Skip to content
Snippets Groups Projects
  1. May 10, 2016
    • Marcelo Vanzin's avatar
      [SPARK-13670][LAUNCHER] Propagate error from launcher to shell. · 36c5892b
      Marcelo Vanzin authored
      bash doesn't really propagate errors from subshells when using redirection
      the way spark-class does; so, instead, this change captures the exit code
      of the launcher process in the command array, and checks it before executing
      the actual command.
      
      Tested by injecting an error in Main.java (the launcher entry point) and
      verifying the shell gets the right exit code from spark-class.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12910 from vanzin/SPARK-13670.
      36c5892b
    • Holden Karau's avatar
      [SPARK-13382][DOCS][PYSPARK] Update pyspark testing notes in build docs · 488863d8
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      The current build documents don't specify that for PySpark tests we need to include Hive in the assembly otherwise the ORC tests fail.
      
      ## How was the this patch tested?
      
      Manually built the docs locally. Ran the provided build command follow by the PySpark SQL tests.
      
      ![pyspark2](https://cloud.githubusercontent.com/assets/59893/13190008/8829cde4-d70f-11e5-8ff5-a88b7894d2ad.png)
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #11278 from holdenk/SPARK-13382-update-pyspark-testing-notes-r2.
      488863d8
    • Herman van Hovell's avatar
      [SPARK-14773] [SPARK-15179] [SQL] Fix SQL building and enable Hive tests · 26462653
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests.
      
      ## How was this patch tested?
      Enabled new tests in HiveComparisionSuite.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12988 from hvanhovell/SPARK-14773.
      26462653
    • Pete Robbins's avatar
      [SPARK-15154] [SQL] Change key types to Long in tests · 2dfb9cd1
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira the 2 tests changed here are using a key of type Integer where the Spark sql code assumes the type is Long. This PR changes the tests to use the correct key types.
      
      ## How was this patch tested?
      
      Test builds run on both Big Endian and Little Endian platforms
      
      Author: Pete Robbins <robbinspg@gmail.com>
      
      Closes #13009 from robbinspg/HashedRelationSuiteFix.
      2dfb9cd1
    • Cheng Lian's avatar
      [SPARK-14127][SQL] "DESC <table>": Extracts schema information from table... · 8a12580d
      Cheng Lian authored
      [SPARK-14127][SQL] "DESC <table>": Extracts schema information from table properties for data source tables
      
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #12934 and #12844. This PR adds a set of utility methods in `DDLUtils` to help extract schema information (user-defined schema, partition columns, and bucketing information) from data source table properties. These utility methods are then used in `DescribeTableCommand` to refine output for data source tables. Before this PR, the aforementioned schema information are only shown as table properties, which are hard to read.
      
      Sample output:
      
      ```
      +----------------------------+---------------------------------------------------------+-------+
      |col_name                    |data_type                                                |comment|
      +----------------------------+---------------------------------------------------------+-------+
      |a                           |bigint                                                   |       |
      |b                           |bigint                                                   |       |
      |c                           |bigint                                                   |       |
      |d                           |bigint                                                   |       |
      |# Partition Information     |                                                         |       |
      |# col_name                  |                                                         |       |
      |d                           |                                                         |       |
      |                            |                                                         |       |
      |# Detailed Table Information|                                                         |       |
      |Database:                   |default                                                  |       |
      |Owner:                      |lian                                                     |       |
      |Create Time:                |Tue May 10 03:20:34 PDT 2016                             |       |
      |Last Access Time:           |Wed Dec 31 16:00:00 PST 1969                             |       |
      |Location:                   |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
      |Table Type:                 |MANAGED                                                  |       |
      |Table Parameters:           |                                                         |       |
      |  rawDataSize               |-1                                                       |       |
      |  numFiles                  |1                                                        |       |
      |  transient_lastDdlTime     |1462875634                                               |       |
      |  totalSize                 |684                                                      |       |
      |  spark.sql.sources.provider|parquet                                                  |       |
      |  EXTERNAL                  |FALSE                                                    |       |
      |  COLUMN_STATS_ACCURATE     |false                                                    |       |
      |  numRows                   |-1                                                       |       |
      |                            |                                                         |       |
      |# Storage Information       |                                                         |       |
      |SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       |       |
      |InputFormat:                |org.apache.hadoop.mapred.SequenceFileInputFormat         |       |
      |OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|       |
      |Compressed:                 |No                                                       |       |
      |Num Buckets:                |2                                                        |       |
      |Bucket Columns:             |[b]                                                      |       |
      |Sort Columns:               |[c]                                                      |       |
      |Storage Desc Parameters:    |                                                         |       |
      |  path                      |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
      |  serialization.format      |1                                                        |       |
      +----------------------------+---------------------------------------------------------+-------+
      ```
      
      ## How was this patch tested?
      
      Test cases are added in `HiveDDLSuite` to check command output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13025 from liancheng/spark-14127-extract-schema-info.
      8a12580d
    • jerryshao's avatar
      [SPARK-14963][YARN] Using recoveryPath if NM recovery is enabled · aab99d31
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      From Hadoop 2.5+, Yarn NM supports NM recovery which using recovery path for auxiliary services such as spark_shuffle, mapreduce_shuffle. So here change to use this path install of NM local dir if NM recovery is enabled.
      
      ## How was this patch tested?
      
      Unit test + local test.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12994 from jerryshao/SPARK-14963.
      aab99d31
    • Sital Kedia's avatar
      [SPARK-14542][CORE] PipeRDD should allow configurable buffer size for… · a019e6ef
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Currently PipedRDD internally uses PrintWriter to write data to the stdin of the piped process, which by default uses a BufferedWriter of buffer size 8k. In our experiment, we have seen that 8k buffer size is too small and the job spends significant amount of CPU time in system calls to copy the data. We should have a way to configure the buffer size for the writer.
      
      ## How was this patch tested?
      Ran PipedRDDSuite tests.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #12309 from sitalkedia/bufferedPipedRDD.
      a019e6ef
    • gatorsmile's avatar
      [SPARK-15215][SQL] Fix Explain Parsing and Output · 57064726
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to address a few existing issues in `EXPLAIN`:
      - The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
      - The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
      - The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
        ```
        == Parsed Logical Plan ==
        CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
      
        == Analyzed Logical Plan ==
      
        CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
      
        == Optimized Logical Plan ==
        CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
        ...
        ```
      
      #### How was this patch tested?
      Added and modified a few test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12991 from gatorsmile/explainCreateTable.
      57064726
  2. May 09, 2016
    • gatorsmile's avatar
      [SPARK-15187][SQL] Disallow Dropping Default Database · f4537917
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed.
      
      This PR is to disallow users to drop default database.
      
      #### How was this patch tested?
      Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12962 from gatorsmile/dropDefaultDB.
      f4537917
    • Reynold Xin's avatar
      [SPARK-15229][SQL] Make case sensitivity setting internal · 4b4344a8
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive; otherwise it is folded to lowercase. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive.
      
      ## How was this patch tested?
      N/A - a small config documentation change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13011 from rxin/SPARK-15229.
      4b4344a8
    • Andrew Or's avatar
      [SPARK-15234][SQL] Fix spark.catalog.listDatabases.show() · 8f932fb8
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Before:
      ```
      scala> spark.catalog.listDatabases.show()
      +--------------------+-----------+-----------+
      |                name|description|locationUri|
      +--------------------+-----------+-----------+
      |Database[name='de...|
      |Database[name='my...|
      |Database[name='so...|
      +--------------------+-----------+-----------+
      ```
      
      After:
      ```
      +-------+--------------------+--------------------+
      |   name|         description|         locationUri|
      +-------+--------------------+--------------------+
      |default|Default Hive data...|file:/user/hive/w...|
      |  my_db|  This is a database|file:/Users/andre...|
      |some_db|                    |file:/private/var...|
      +-------+--------------------+--------------------+
      ```
      
      ## How was this patch tested?
      
      New test in `CatalogSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13015 from andrewor14/catalog-show.
      8f932fb8
    • xin Wu's avatar
      [SPARK-15025][SQL] fix duplicate of PATH key in datasource table options · 980bba0d
      xin Wu authored
      ## What changes were proposed in this pull request?
      The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
      ```
      val optionsWithPath =
            if (!options.contains("path")) {
              isExternal = false
              options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
            } else {
              options
            }
      ```
      So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".
      
      The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.
      
      ## How was this patch tested?
      A testcase is added
      
      Author: xin Wu <xinwu@us.ibm.com>
      
      Closes #12804 from xwu0226/SPARK-15025.
      980bba0d
    • Josh Rosen's avatar
      [SPARK-15209] Fix display of job descriptions with single quotes in web UI timeline · 3323d0f9
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes an escaping bug in the Web UI's event timeline that caused Javascript errors when displaying timeline entries whose descriptions include single quotes.
      
      The original bug can be reproduced by running
      
      ```scala
      sc.setJobDescription("double quote: \" ")
      sc.parallelize(1 to 10).count()
      
      sc.setJobDescription("single quote: ' ")
      sc.parallelize(1 to 10).count()
      ```
      
      and then browsing to the driver UI. Previously, this resulted in an "Uncaught SyntaxError" because the single quote from the description was not escaped and ended up closing a Javascript string literal too early.
      
      The fix implemented here is to change the relevant Javascript to define its string literals using double-quotes. Our escaping logic already properly escapes double quotes in the description, so this is safe to do.
      
      ## How was this patch tested?
      
      Tested manually in `spark-shell` using the following cases:
      
      ```scala
      sc.setJobDescription("double quote: \" ")
      sc.parallelize(1 to 10).count()
      
      sc.setJobDescription("single quote: ' ")
      sc.parallelize(1 to 10).count()
      
      sc.setJobDescription("ampersand: &")
      sc.parallelize(1 to 10).count()
      
      sc.setJobDescription("newline: \n text after newline ")
      sc.parallelize(1 to 10).count()
      
      sc.setJobDescription("carriage return: \r text after return ")
      sc.parallelize(1 to 10).count()
      ```
      
      /cc sarutak for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12995 from JoshRosen/SPARK-15209.
      3323d0f9
    • Josh Rosen's avatar
      [SPARK-14972] Improve performance of JSON schema inference's compatibleType method · c3350cad
      Josh Rosen authored
      This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema.
      
      The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.
      
      This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.
      
      I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods.
      
      Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12750 from JoshRosen/schema-inference-speedups.
      c3350cad
    • Wenchen Fan's avatar
      [SPARK-15173][SQL] DataFrameWriter.insertInto should work with datasource table stored in hive · 2adb11f6
      Wenchen Fan authored
      When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.
      
      new test in `SQLQuerySuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12949 from cloud-fan/bug.
      2adb11f6
    • Alex Bozarth's avatar
      [SPARK-10653][CORE] Remove unnecessary things from SparkEnv · c3e23bc0
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Removed blockTransferService and sparkFilesDir from SparkEnv since they're rarely used and don't need to be in stored in the env. Edited their few usages to accommodate the change.
      
      ## How was this patch tested?
      
      ran dev/run-tests locally
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #12970 from ajbozarth/spark10653.
      c3e23bc0
    • Andrew Or's avatar
      [SPARK-15166][SQL] Move some hive-specific code from SparkSession · 7bf9b120
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This also simplifies the code being moved.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12941 from andrewor14/move-code.
      7bf9b120
    • Zheng RuiFeng's avatar
      [SPARK-15210][SQL] Add missing @DeveloperApi annotation in sql.types · dfdcab00
      Zheng RuiFeng authored
      add DeveloperApi annotation for `AbstractDataType` `MapType` `UserDefinedType`
      
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12982 from zhengruifeng/types_devapi.
      dfdcab00
    • mwws's avatar
      [SAPRK-15220][UI] add hyperlink to running application and completed application · f8aca5b4
      mwws authored
      ## What changes were proposed in this pull request?
      Add hyperlink to "running application" and "completed application", so user can jump to application table directly, In my environment, I set up 1000+ works and it's painful to scroll down to skip worker list.
      
      ## How was this patch tested?
      manual tested
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      ![sceenshot](https://cloud.githubusercontent.com/assets/13216322/15105718/97e06768-15f6-11e6-809d-3574046751a9.png)
      
      Author: mwws <wei.mao@intel.com>
      
      Closes #12997 from mwws/SPARK_UI.
      f8aca5b4
    • jerryshao's avatar
      [MINOR][SQL] Enhance the exception message if checkpointLocation is not set · ee6a8d7e
      jerryshao authored
      Enhance the exception message when `checkpointLocation` is not set, previously the message is:
      
      ```
      java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:347)
        at scala.None$.get(Option.scala:345)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
        at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
        at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
        at scala.collection.AbstractMap.getOrElse(Map.scala:59)
        at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:337)
        at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:277)
        ... 48 elided
      ```
      
      This is not so meaningful, so changing to make it more specific.
      
      Local verified.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12998 from jerryshao/improve-exception-message.
      ee6a8d7e
    • Sean Owen's avatar
      [SPARK-15067][YARN] YARN executors are launched with fixed perm gen size · 6747171e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Look for MaxPermSize arguments anywhere in an arg, to account for quoted args. See JIRA for discussion.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12985 from srowen/SPARK-15067.
      6747171e
    • Liang-Chi Hsieh's avatar
      [SPARK-15225][SQL] Replace SQLContext with SparkSession in Encoder documentation · e083db2e
      Liang-Chi Hsieh authored
      `Encoder`'s doc mentions `sqlContext.implicits._`. We should use `sparkSession.implicits._` instead now.
      
      Only doc update.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13002 from viirya/encoder-doc.
      e083db2e
    • Philipp Hoffmann's avatar
      [SPARK-15223][DOCS] fix wrongly named config reference · 65b4ab28
      Philipp Hoffmann authored
      ## What changes were proposed in this pull request?
      
      The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.
      
      This commit fixes a remaining reference to the old name in the documentation.
      
      Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.
      
      ## How was this patch tested?
      
      no tests
      
      Author: Philipp Hoffmann <mail@philipphoffmann.de>
      
      Closes #13001 from philipphoffmann/patch-3.
      65b4ab28
    • hyukjinkwon's avatar
      [MINOR][DOCS] Remove remaining sqlContext in documentation at examples · 2992a215
      hyukjinkwon authored
      This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.
      
      Manual style checking.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13006 from HyukjinKwon/minor-docs.
      2992a215
    • Cheng Lian's avatar
      [SPARK-14127][SQL] Makes 'DESC [EXTENDED|FORMATTED] <table>' support data source tables · 671b382a
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.
      
      ## How was this patch tested?
      
      A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12934 from liancheng/spark-14127-desc-table-follow-up.
      671b382a
    • gatorsmile's avatar
      [SPARK-15199][SQL] Disallow Dropping Build-in Functions · b1e01fd5
      gatorsmile authored
      #### What changes were proposed in this pull request?
      As Hive and the major RDBMS behave, the built-in functions are not allowed to drop. In the current implementation, users can drop the built-in functions. However, after dropping the built-in functions, users are unable to add them back.
      
      #### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12975 from gatorsmile/dropBuildInFunction.
      b1e01fd5
    • Wenchen Fan's avatar
      [SPARK-15093][SQL] create/delete/rename directory for InMemoryCatalog operations if needed · beb16ec5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      following operations have file system operation now:
      
      1. CREATE DATABASE: create a dir
      2. DROP DATABASE: delete the dir
      3. CREATE TABLE: create a dir
      4. DROP TABLE: delete the dir
      5. RENAME TABLE: rename the dir
      6. CREATE PARTITIONS: create a dir
      7. RENAME PARTITIONS: rename the dir
      8. DROP PARTITIONS: drop the dir
      
      ## How was this patch tested?
      
      new tests in `ExternalCatalogSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12871 from cloud-fan/catalog.
      beb16ec5
    • Yanbo Liang's avatar
      [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader · ee3b1715
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
      * Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13005 from yanboliang/r-df-examples.
      ee3b1715
    • Ryan Blue's avatar
      [SPARK-14459][SQL] Detect relation partitioning and adjust the logical plan · 652bbb1b
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This detects a relation's partitioning and adds checks to the analyzer.
      If an InsertIntoTable node has no partitioning, it is replaced by the
      relation's partition scheme and input columns are correctly adjusted,
      placing the partition columns at the end in partition order. If an
      InsertIntoTable node has partitioning, it is checked against the table's
      reported partitions.
      
      These changes required adding a PartitionedRelation trait to the catalog
      interface because Hive's MetastoreRelation doesn't extend
      CatalogRelation.
      
      This commit also includes a fix to InsertIntoTable's resolved logic,
      which now detects that all expected columns are present, including
      dynamic partition columns. Previously, the number of expected columns
      was not checked and resolved was true if there were missing columns.
      
      ## How was this patch tested?
      
      This adds new tests to the InsertIntoTableSuite that are fixed by this PR.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning.
      652bbb1b
    • mwws's avatar
      [MINOR][TEST][STREAMING] make "testDir" able to be claened after test. · 16a503cf
      mwws authored
      It's a minor bug in test case. `val testDir = null` will keep be `null` as it's immutable, so in finally block, nothing will be cleaned. Another `testDir` variable created in try block is only visible in try block.
      
      ## How was this patch tested?
      Run existing test case and passed.
      
      Author: mwws <wei.mao@intel.com>
      
      Closes #12999 from mwws/SPARK_MINOR.
      16a503cf
    • dding3's avatar
      [SPARK-15172][ML] Explicitly tell user initial coefficients is ignored when... · a78fbfa6
      dding3 authored
      [SPARK-15172][ML] Explicitly tell user initial coefficients is ignored when size mismatch happened in LogisticRegression
      
      ## What changes were proposed in this pull request?
      Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression
      
      ## How was this patch tested?
      local build
      
      Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>
      
      Closes #12948 from dding3/master.
      a78fbfa6
    • Holden Karau's avatar
      [SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default param doc note · 12fe2ecd
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT
      
      ## How was this patch tested?
      
      Built docs locally.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
      12fe2ecd
    • Yuhao Yang's avatar
      [SPARK-14814][MLLIB] API: Java compatibility, docs · 68abc1b4
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-14814
      fix a java compatibility function in mllib DecisionTreeModel. As synced in jira, other compatibility issues don't need fixes.
      
      ## How was this patch tested?
      
      existing ut
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #12971 from hhbyyh/javacompatibility.
      68abc1b4
    • Liang-Chi Hsieh's avatar
      [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure · 635ef407
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure.
      
      ## How was this patch tested?
      `LibSVMRelationSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12986 from viirya/fix-libsvmrelation.
      635ef407
    • gatorsmile's avatar
      [SPARK-15184][SQL] Fix Silent Removal of An Existent Temp Table by Rename Table · a59ab594
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Currently, if we rename a temp table `Tab1` to another existent temp table `Tab2`. `Tab2` will be silently removed. This PR is to detect it and issue an exception message.
      
      In addition, this PR also detects another issue in the rename table command. When the destination table identifier does have database name, we should not ignore them. That might mean users could rename a regular table.
      
      #### How was this patch tested?
      Added two related test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12959 from gatorsmile/rewriteTable.
      a59ab594
  3. May 08, 2016
    • gatorsmile's avatar
      [SPARK-15185][SQL] InMemoryCatalog: Silent Removal of an Existent... · e9131ec2
      gatorsmile authored
      [SPARK-15185][SQL] InMemoryCatalog: Silent Removal of an Existent Table/Function/Partitions by Rename
      
      #### What changes were proposed in this pull request?
      So far, in the implementation of InMemoryCatalog, we do not check if the new/destination table/function/partition exists or not. Thus, we just silently remove the existent table/function/partition.
      
      This PR is to detect them and issue an appropriate exception.
      
      #### How was this patch tested?
      Added the related test cases. They also verify if HiveExternalCatalog also detects these errors.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12960 from gatorsmile/renameInMemoryCatalog.
      e9131ec2
    • Sun Rui's avatar
      [SPARK-12479][SPARKR] sparkR collect on GroupedData throws R error "missing... · 454ba4d6
      Sun Rui authored
      [SPARK-12479][SPARKR] sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
      
      ## What changes were proposed in this pull request?
      
      This PR is a workaround for NA handling in hash code computation.
      
      This PR is on behalf of paulomagalhaes whose PR is https://github.com/apache/spark/pull/10436
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <sunrui2016@gmail.com>
      Author: ray <ray@rays-MacBook-Air.local>
      
      Closes #12976 from sun-rui/SPARK-12479.
      454ba4d6
  4. May 07, 2016
    • Sandeep Singh's avatar
      [SPARK-15178][CORE] Remove LazyFileRegion instead use netty's DefaultFileRegion · 6e268b9e
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Remove LazyFileRegion instead use netty's DefaultFileRegion, since It was created so that we didn't create a file descriptor before having to send the file.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12977 from techaddict/SPARK-15178.
      6e268b9e
    • Bryan Cutler's avatar
      [DOC][MINOR] Fixed minor errors in feature.ml user guide doc · 5d188a69
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed some minor errors found when reviewing feature.ml user guide
      
      ## How was this patch tested?
      built docs locally
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
      5d188a69
    • Nick Pentreath's avatar
      [MINOR][ML][PYSPARK] ALS example cleanup · b0cafdb6
      Nick Pentreath authored
      Cleans up ALS examples by removing unnecessary casts to double for `rating` and `prediction` columns, since `RegressionEvaluator` now supports `Double` & `Float` input types.
      
      ## How was this patch tested?
      
      Manual compile and run with `run-example ml.ALSExample` and `spark-submit examples/src/main/python/ml/als_example.py`.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #12892 from MLnick/als-examples-cleanup.
      b0cafdb6
Loading