Skip to content
Snippets Groups Projects
  1. Nov 24, 2016
  2. Nov 23, 2016
    • Shixiong Zhu's avatar
      [SPARK-18510][SQL] Follow up to address comments in #15951 · 223fa218
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR addressed the rest comments in #15951.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15997 from zsxwing/SPARK-18510-follow-up.
      223fa218
    • Burak Yavuz's avatar
      [SPARK-18510] Fix data corruption from inferred partition column dataTypes · 0d1bf2b6
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      ### The Issue
      
      If I specify my schema when doing
      ```scala
      spark.read
        .schema(someSchemaWherePartitionColumnsAreStrings)
      ```
      but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted.
      
      ### Proposed solution
      
      The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path.
      
      The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption.
      
      My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type.
      
      We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later.
      
      A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in.
      
      ## How was this patch tested?
      
      Regression tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15951 from brkyvz/partition-corruption.
      0d1bf2b6
    • Wenchen Fan's avatar
      [SPARK-18050][SQL] do not create default database if it already exists · f129ebcd
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we try to create the default database, we ask hive to do nothing if it already exists. However, Hive will log an error message instead of doing nothing, and the error message is quite annoying and confusing.
      
      In this PR, we only create default database if it doesn't exist.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15993 from cloud-fan/default-db.
      f129ebcd
    • Reynold Xin's avatar
      [SPARK-18522][SQL] Explicit contract for column stats serialization · 70ad07a9
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The current implementation of column stats uses the base64 encoding of the internal UnsafeRow format to persist statistics (in table properties in Hive metastore). This is an internal format that is not stable across different versions of Spark and should NOT be used for persistence. In addition, it would be better if statistics stored in the catalog is human readable.
      
      This pull request introduces the following changes:
      
      1. Created a single ColumnStat class to for all data types. All data types track the same set of statistics.
      2. Updated the implementation for stats collection to get rid of the dependency on internal data structures (e.g. InternalRow, or storing DateType as an int32). For example, previously dates were stored as a single integer, but are now stored as java.sql.Date. When we implement the next steps of CBO, we can add code to convert those back into internal types again.
      3. Documented clearly what JVM data types are being used to store what data.
      4. Defined a simple Map[String, String] interface for serializing and deserializing column stats into/from the catalog.
      5. Rearranged the method/function structure so it is more clear what the supported data types are, and also moved how stats are generated into ColumnStat class so they are easy to find.
      
      ## How was this patch tested?
      Removed most of the original test cases created for column statistics, and added three very simple ones to cover all the cases. The three test cases validate:
      1. Roundtrip serialization works.
      2. Behavior when analyzing non-existent column or unsupported data type column.
      3. Result for stats collection for all valid data types.
      
      Also moved parser related tests into a parser test suite and added an explicit serialization test for the Hive external catalog.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15959 from rxin/SPARK-18522.
      70ad07a9
    • Reynold Xin's avatar
      [SPARK-18557] Downgrade confusing memory leak warning message · 9785ed40
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      TaskMemoryManager has a memory leak detector that gets called at task completion callback and checks whether any memory has not been released. If they are not released by the time the callback is invoked, TaskMemoryManager releases them.
      
      The current error message says something like the following:
      ```
      WARN  [Executor task launch worker-0]
      org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
      org.apache.spark.unsafe.map.BytesToBytesMap33fb6a15
      In practice, there are multiple reasons why these can be triggered in the normal code path (e.g. limit, or task failures), and the fact that these messages are log means the "leak" is fixed by TaskMemoryManager.
      ```
      
      To not confuse users, this patch downgrade the message from warning to debug level, and avoids using the word "leak" since it is not actually a leak.
      
      ## How was this patch tested?
      N/A - this is a simple logging improvement.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15989 from rxin/SPARK-18557.
      9785ed40
    • Wenchen Fan's avatar
      [SPARK-18053][SQL] compare unsafe and safe complex-type values correctly · 84284e8c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In Spark SQL, some expression may output safe format values, e.g. `CreateArray`, `CreateStruct`, `Cast`, etc. When we compare 2 values, we should be able to compare safe and unsafe formats.
      
      The `GreaterThan`, `LessThan`, etc. in Spark SQL already handles it, but the `EqualTo` doesn't. This PR fixes it.
      
      ## How was this patch tested?
      
      new unit test and regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15929 from cloud-fan/type-aware.
      84284e8c
    • Eric Liang's avatar
      [SPARK-18545][SQL] Verify number of hive client RPCs in PartitionedTablePerfStatsSuite · 85235ed6
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This would help catch accidental O(n) calls to the hive client as in https://issues.apache.org/jira/browse/SPARK-18507
      
      ## How was this patch tested?
      
      Checked that the test fails before https://issues.apache.org/jira/browse/SPARK-18507 was patched. cc cloud-fan
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15985 from ericl/spark-18545.
      85235ed6
    • Sean Owen's avatar
      [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site · 7e0cd1d9
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Updates links to the wiki to links to the new location of content on spark.apache.org.
      
      ## How was this patch tested?
      
      Doc builds
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15967 from srowen/SPARK-18073.1.
      Unverified
      7e0cd1d9
    • hyukjinkwon's avatar
      [SPARK-18179][SQL] Throws analysis exception with a proper message for... · 2559fb4b
      hyukjinkwon authored
      [SPARK-18179][SQL] Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
      
      ## What changes were proposed in this pull request?
      
      This PR proposes throwing an `AnalysisException` with a proper message rather than `NoSuchElementException` with the message ` key not found: TimestampType` when unsupported types are given to `reflect` and `java_method` functions.
      
      ```scala
      spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', cast('1990-01-01' as timestamp))")
      ```
      
      produces
      
      **Before**
      
      ```
      java.util.NoSuchElementException: key not found: TimestampType
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
      ...
      ```
      
      **After**
      
      ```
      cannot resolve 'reflect('java.lang.String', 'valueOf', CAST('1990-01-01' AS TIMESTAMP))' due to data type mismatch: arguments from the third require boolean, byte, short, integer, long, float, double or string expressions; line 1 pos 0;
      'Project [unresolvedalias(reflect(java.lang.String, valueOf, cast(1990-01-01 as timestamp)), Some(<function1>))]
      +- Range (0, 1, step=1, splits=Some(2))
      ...
      ```
      
      Added message is,
      
      ```
      arguments from the third require boolean, byte, short, integer, long, float, double or string expressions
      ```
      
      ## How was this patch tested?
      
      Tests added in `CallMethodViaReflection`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15694 from HyukjinKwon/SPARK-18179.
      2559fb4b
  3. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · 982b82e3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      982b82e3
    • Shixiong Zhu's avatar
      [SPARK-18530][SS][KAFKA] Change Kafka timestamp column type to TimestampType · d0212eb0
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Changed Kafka timestamp column type to TimestampType.
      
      ## How was this patch tested?
      
      `test("Kafka column types")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15969 from zsxwing/SPARK-18530.
      d0212eb0
    • Dilip Biswal's avatar
      [SPARK-18533] Raise correct error upon specification of schema for datasource... · 39a1d306
      Dilip Biswal authored
      [SPARK-18533] Raise correct error upon specification of schema for datasource tables created using CTAS
      
      ## What changes were proposed in this pull request?
      Fixes the inconsistency of error raised between data source and hive serde
      tables when schema is specified in CTAS scenario. In the process the grammar for
      create table (datasource) is simplified.
      
      **before:**
      ``` SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1;
      Error in query:
      mismatched input 'as' expecting {<EOF>, '.', 'OPTIONS', 'CLUSTERED', 'PARTITIONED'}(line 1, pos 64)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ----------------------------------------------------------------^^^
      ```
      
      **After:**
      ```SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1
               > ;
      Error in query:
      Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 1, pos 0)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ^^^
      ```
      ## How was this patch tested?
      Added a new test in CreateTableAsSelectSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15968 from dilipbiswal/ctas.
      39a1d306
    • gatorsmile's avatar
      [SPARK-16803][SQL] SaveAsTable does not work when target table is a Hive serde table · 9c42d4a7
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
      
      **Spark 1.6**
      
      ``` Scala
      scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> val df = sql("select key, value as value from sample.sample")
      df: org.apache.spark.sql.DataFrame = [key: int, value: string]
      
      scala> df.write.mode("append").saveAsTable("sample.sample")
      
      scala> sql("select * from sample.sample").show()
      +---+-----+
      |key|value|
      +---+-----+
      |  1|  abc|
      |  1|  abc|
      +---+-----+
      ```
      
      **Spark 2.0**
      
      ``` Scala
      scala> df.write.mode("append").saveAsTable("sample.sample")
      org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
       is not supported.;
      ```
      
      So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
      
      Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
      ### How was this patch tested?
      
      Test cases are added
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15926 from gatorsmile/saveAsTableFix5.
      9c42d4a7
    • Shixiong Zhu's avatar
      [SPARK-18373][SPARK-18529][SS][KAFKA] Make failOnDataLoss=false work with Spark jobs · 2fd101b2
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`.
      
      It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529) after refactoring codes: Timeout will throw a TimeoutException.
      
      ## How was this patch tested?
      
      Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases.
      
      I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15820 from zsxwing/failOnDataLoss.
      2fd101b2
    • Burak Yavuz's avatar
      [SPARK-18465] Add 'IF EXISTS' clause to 'UNCACHE' to not throw exceptions when table doesn't exist · bdc8153e
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      While this behavior is debatable, consider the following use case:
      ```sql
      UNCACHE TABLE foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
      The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
      
      Now we can do:
      ```sql
      UNCACHE TABLE IF EXISTS foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15896 from brkyvz/uncache.
      bdc8153e
    • Wenchen Fan's avatar
      [SPARK-18507][SQL] HiveExternalCatalog.listPartitions should only call getTable once · 702cd403
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      HiveExternalCatalog.listPartitions should only call `getTable` once, instead of calling it for every partitions.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15978 from cloud-fan/perf.
      702cd403
    • Nattavut Sutyanyong's avatar
      [SPARK-18504][SQL] Scalar subquery with extra group by columns returning incorrect result · 45ea46b7
      Nattavut Sutyanyong authored
      ## What changes were proposed in this pull request?
      
      This PR blocks an incorrect result scenario in scalar subquery where there are GROUP BY column(s)
      that are not part of the correlated predicate(s).
      
      Example:
      // Incorrect result
      Seq(1).toDF("c1").createOrReplaceTempView("t1")
      Seq((1,1),(1,2)).toDF("c1","c2").createOrReplaceTempView("t2")
      sql("select (select sum(-1) from t2 where t1.c1=t2.c1 group by t2.c2) from t1").show
      
      // How can selecting a scalar subquery from a 1-row table return 2 rows?
      
      ## How was this patch tested?
      sql/test, catalyst/test
      new test case covering the reported problem is added to SubquerySuite.scala
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #15936 from nsyca/scalarSubqueryIncorrect-1.
      45ea46b7
    • Wenchen Fan's avatar
      [SPARK-18519][SQL] map type can not be used in EqualTo · bb152cdf
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Technically map type is not orderable, but can be used in equality comparison. However, due to the limitation of the current implementation, map type can't be used in equality comparison so that it can't be join key or grouping key.
      
      This PR makes this limitation explicit, to avoid wrong result.
      
      ## How was this patch tested?
      
      updated tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15956 from cloud-fan/map-type.
      bb152cdf
    • hyukjinkwon's avatar
      [SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across... · 933a6548
      hyukjinkwon authored
      [SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Python, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `.. note::`
      
      This PR proposes to fix those to `.. note::` to be consistent.
      
      **Before**
      
      <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png">
      
      <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png">
      
      **After**
      
      <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png">
      
      <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png">
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "Note: " .
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `make html` under `./python/docs`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15947 from HyukjinKwon/SPARK-18447.
      Unverified
      933a6548
    • hyukjinkwon's avatar
      [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across R API documentation · 4922f9cd
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems in R, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      
      This PR proposes to fix those to `Note:` to be consistent.
      
      **Before**
      
      ![2016-11-21 11 30 07](https://cloud.githubusercontent.com/assets/6477701/20468848/2f27b0fa-afde-11e6-89e3-993701269dbe.png)
      
      **After**
      
      ![2016-11-21 11 29 44](https://cloud.githubusercontent.com/assets/6477701/20468851/39469664-afde-11e6-9929-ad80be7fc405.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `sh create-docs.sh` under `./R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15952 from HyukjinKwon/SPARK-18514.
      Unverified
      4922f9cd
    • Yanbo Liang's avatar
      [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. · acb97157
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
      ```
      ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
      ```
      The following is output:
      ```
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var, window
      
      The following objects are masked from ‘package:base’:
      
          as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
          rank, rbind, sample, startsWith, subset, summary, transform, union
      
      Spark not found in SPARK_HOME:
      Spark not found in the cache directory. Installation will start.
      MirrorUrl not provided.
      Looking for preferred site from apache website...
      ......
      ```
      There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15888 from yanboliang/spark-18444.
      acb97157
  4. Nov 21, 2016
    • Liwei Lin's avatar
      [SPARK-18425][STRUCTURED STREAMING][TESTS] Test `CompactibleFileStreamLog` directly · ebeb0830
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).
      
      Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like https://github.com/apache/spark/pull/15828, https://github.com/apache/spark/pull/15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.
      
      ## How was this patch tested?
      
      the PR itself is about tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15870 from lw-lin/test-compact-1113.
      ebeb0830
    • Burak Yavuz's avatar
      [SPARK-18493] Add missing python APIs: withWatermark and checkpoint to dataframe · 97a8239a
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds two of the newly added methods of `Dataset`s to Python:
      `withWatermark` and `checkpoint`
      
      ## How was this patch tested?
      
      Doc tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15921 from brkyvz/py-watermark.
      97a8239a
    • hyukjinkwon's avatar
      [SPARK-17765][SQL] Support for writing out user-defined type in ORC datasource · a2d46477
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source.
      
      In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2].
      
      So, running the codes below (`MyDenseVector` was borrowed[3]) :
      
      ``` scala
      val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25))))
      val udtDF = data.toDF("id", "vectors")
      udtDF.write.orc("/tmp/test.orc")
      ```
      
      ends up throwing an exception as below:
      
      ```
      java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType
          at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381)
          at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164)
      ...
      ```
      
      So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source.
      
      [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95
      [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326
      [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70
      ## How was this patch tested?
      
      Unit tests in `OrcQuerySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15361 from HyukjinKwon/SPARK-17765.
      a2d46477
    • Dongjoon Hyun's avatar
      [SPARK-18517][SQL] DROP TABLE IF EXISTS should not warn for non-existing tables · ddd02f50
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. However, it had better be quiet for this case by definition of the command.
      
      **BEFORE**
      ```scala
      scala> sql("DROP TABLE IF EXISTS nonexist")
      16/11/20 20:48:26 WARN DropTableCommand: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'nonexist' not found in database 'default';
      ```
      
      **AFTER**
      ```scala
      scala> sql("DROP TABLE IF EXISTS nonexist")
      res0: org.apache.spark.sql.DataFrame = []
      ```
      
      ## How was this patch tested?
      
      Manual because this is related to the warning messages instead of exceptions.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15953 from dongjoon-hyun/SPARK-18517.
      ddd02f50
    • Gabriel Huang's avatar
      [SPARK-18361][PYSPARK] Expose RDD localCheckpoint in PySpark · 70176871
      Gabriel Huang authored
      ## What changes were proposed in this pull request?
      
      Expose RDD's localCheckpoint() and associated functions in PySpark.
      
      ## How was this patch tested?
      
      I added a UnitTest in python/pyspark/tests.py which passes.
      
      I certify that this is my original work, and I license it to the project under the project's open source license.
      
      Gabriel HUANG
      Developer at Cardabel (http://cardabel.com/)
      
      Author: Gabriel Huang <gabi.xiaohuang@gmail.com>
      
      Closes #15811 from gabrielhuang/pyspark-localcheckpoint.
      70176871
    • Dongjoon Hyun's avatar
      [SPARK-18413][SQL] Add `maxConnections` JDBCOption · 07beb5d2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.
      
      **Reported Scenario**
      
      For the following cases, the number of connections becomes 200 and database cannot handle all of them.
      
      ```sql
      CREATE OR REPLACE TEMPORARY VIEW resultview
      USING org.apache.spark.sql.jdbc
      OPTIONS (
        url "jdbc:oracle:thin:10.129.10.111:1521:BKDB",
        dbtable "result",
        user "HIVE",
        password "HIVE"
      );
      -- set spark.sql.shuffle.partitions=200
      INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g
      ```
      
      ## How was this patch tested?
      
      Manual. Do the followings and see Spark UI.
      
      **Step 1 (MySQL)**
      ```
      CREATE TABLE t1 (a INT);
      CREATE TABLE data (a INT);
      INSERT INTO data VALUES (1);
      INSERT INTO data VALUES (2);
      INSERT INTO data VALUES (3);
      ```
      
      **Step 2 (Spark)**
      ```scala
      SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
      scala> sql("SET spark.sql.shuffle.partitions=3")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      ```
      
      ![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15868 from dongjoon-hyun/SPARK-18413.
      Unverified
      07beb5d2
    • Takuya UESHIN's avatar
      [SPARK-18398][SQL] Fix nullabilities of MapObjects and ExternalMapToCatalyst. · 9f262ae1
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      The nullabilities of `MapObject` can be made more strict by relying on `inputObject.nullable` and `lambdaFunction.nullable`.
      
      Also `ExternalMapToCatalyst.dataType` can be made more strict by relying on `valueConverter.nullable`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15840 from ueshin/issues/SPARK-18398.
      9f262ae1
    • sethah's avatar
      [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM · e811fbf9
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15777 from sethah/pyspark_cluster_summaries.
      e811fbf9
  5. Nov 20, 2016
    • Takuya UESHIN's avatar
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke,... · 65854797
      Takuya UESHIN authored
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke, Invoke and NewInstance and modify to short circuit if arguments have null when `needNullCheck == true`.
      
      ## What changes were proposed in this pull request?
      
      This pr extracts method for preparing arguments from `StaticInvoke`, `Invoke` and `NewInstance` and modify to short circuit if arguments have `null` when `propageteNull == true`.
      
      The steps are as follows:
      
      1. Introduce `InvokeLike` to extract common logic from `StaticInvoke`, `Invoke` and `NewInstance` to prepare arguments.
      `StaticInvoke` and `Invoke` had a risk to exceed 64kb JVM limit to prepare arguments but after this patch they can handle them because they share the preparing code of NewInstance, which handles the limit well.
      
      2. Remove unneeded null checking and fix nullability of `NewInstance`.
      Avoid some of nullabilty checking which are not needed because the expression is not nullable.
      
      3. Modify to short circuit if arguments have `null` when `needNullCheck == true`.
      If `needNullCheck == true`, preparing arguments can be skipped if we found one of them is `null`, so modified to short circuit in the case.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15901 from ueshin/issues/SPARK-18467.
      65854797
    • Reynold Xin's avatar
      [HOTFIX][SQL] Fix DDLSuite failure. · b625a36e
      Reynold Xin authored
      b625a36e
    • Reynold Xin's avatar
      Fix Mesos build break for Scala 2.10. · 6659ae55
      Reynold Xin authored
      6659ae55
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc · c528812c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes/fixes two things.
      
      - Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
      
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
        [error]   /** group setParam */
        [error]       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
        [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
        [error]    ^
        ...
        ```
      
        It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
      
      - Print `example` in javadoc.
        Currently, there are few `example` tag in several places.
      
        ```
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
        ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
        ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
        ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
        ```
      
      **Before**
      
        <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
      
      **After**
        <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png">
      
      ## How was this patch tested?
      
      Maunally tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
      Unverified
      c528812c
    • Herman van Hovell's avatar
      [SPARK-15214][SQL] Code-generation for Generate · 7ca7a635
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      
      This PR adds code generation to `Generate`. It supports two code paths:
      - General `TraversableOnce` based iteration. This used for regular `Generator` (code generation supporting) expressions. This code path expects the expression to return a `TraversableOnce[InternalRow]` and it will iterate over the returned collection. This PR adds code generation for the `stack` generator.
      - Specialized `ArrayData/MapData` based iteration. This is used for the `explode`, `posexplode` & `inline` functions and operates directly on the `ArrayData`/`MapData` result that the child of the generator returns.
      
      ### Benchmarks
      I have added some benchmarks and it seems we can create a nice speedup for explode:
      #### Environment
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6
      Intel(R) Core(TM) i7-4980HQ CPU  2.80GHz
      ```
      #### Explode Array
      ##### Before
      ```
      generate explode array:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode array wholestage off         7377 / 7607          2.3         439.7       1.0X
      generate explode array wholestage on          6055 / 6086          2.8         360.9       1.2X
      ```
      ##### After
      ```
      generate explode array:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode array wholestage off         7432 / 7696          2.3         443.0       1.0X
      generate explode array wholestage on           631 /  646         26.6          37.6      11.8X
      ```
      #### Explode Map
      ##### Before
      ```
      generate explode map:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode map wholestage off         12792 / 12848          1.3         762.5       1.0X
      generate explode map wholestage on          11181 / 11237          1.5         666.5       1.1X
      ```
      ##### After
      ```
      generate explode map:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode map wholestage off         10949 / 10972          1.5         652.6       1.0X
      generate explode map wholestage on             870 /  913         19.3          51.9      12.6X
      ```
      #### Posexplode
      ##### Before
      ```
      generate posexplode array:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate posexplode array wholestage off      7547 / 7580          2.2         449.8       1.0X
      generate posexplode array wholestage on       5786 / 5838          2.9         344.9       1.3X
      ```
      ##### After
      ```
      generate posexplode array:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate posexplode array wholestage off      7535 / 7548          2.2         449.1       1.0X
      generate posexplode array wholestage on        620 /  624         27.1          37.0      12.1X
      ```
      #### Inline
      ##### Before
      ```
      generate inline array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate inline array wholestage off          6935 / 6978          2.4         413.3       1.0X
      generate inline array wholestage on           6360 / 6400          2.6         379.1       1.1X
      ```
      ##### After
      ```
      generate inline array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate inline array wholestage off          6940 / 6966          2.4         413.6       1.0X
      generate inline array wholestage on           1002 / 1012         16.7          59.7       6.9X
      ```
      #### Stack
      ##### Before
      ```
      generate stack:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate stack wholestage off               12980 / 13104          1.3         773.7       1.0X
      generate stack wholestage on                11566 / 11580          1.5         689.4       1.1X
      ```
      ##### After
      ```
      generate stack:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate stack wholestage off               12875 / 12949          1.3         767.4       1.0X
      generate stack wholestage on                   840 /  845         20.0          50.0      15.3X
      ```
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13065 from hvanhovell/SPARK-15214.
      7ca7a635
  6. Nov 19, 2016
    • Reynold Xin's avatar
      a64f25d8
    • Reynold Xin's avatar
      [SPARK-18508][SQL] Fix documentation error for DateDiff · bce9a036
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The previous documentation and example for DateDiff was wrong.
      
      ## How was this patch tested?
      Doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15937 from rxin/datediff-doc.
      bce9a036
    • Kazuaki Ishizaki's avatar
      [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java · d93b6552
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
      
      ## How was this patch tested?
      
      Manually executed query82 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15907 from kiszk/SPARK-18458.
      d93b6552
    • sethah's avatar
      [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training · 856e0042
      sethah authored
      ## What changes were proposed in this pull request?
      
      This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
      
      Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.
      
      ## How was this patch tested?
      
      This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15893 from sethah/logreg_refactor.
      Unverified
      856e0042
    • Stavros Kontopoulos's avatar
      [SPARK-17062][MESOS] add conf option to mesos dispatcher · ea77c81e
      Stavros Kontopoulos authored
      Adds --conf option to set spark configuration properties in mesos dispacther.
      Properties provided with --conf take precedence over properties within the properties file.
      The reason for this PR is that for simple configuration or testing purposes we need to provide a property file (ideally a shared one for a cluster) even if we just provide a single property.
      
      Manually tested.
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
      
      Closes #14650 from skonto/dipatcher_conf.
      ea77c81e
Loading