Skip to content
Snippets Groups Projects
  1. Apr 06, 2016
    • Holden Karau's avatar
      [SPARK-14424][BUILD][DOCS] Update the build docs to switch from assembly to package and add a no… · 457e58be
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Change our build docs & shell scripts to that developers are aware of the change from "assembly" to "package"
      
      ## How was this patch tested?
      
      Manually ran ./bin/spark-shell after ./build/sbt assembly and verified error message printed, ran new suggested build target and verified ./bin/spark-shell runs after this.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12197 from holdenk/SPARK-1424-spark-class-broken-fix-build-docs.
      457e58be
    • Tathagata Das's avatar
      [SPARK-12133][STREAMING] Streaming dynamic allocation · 9af5423e
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation.
      
      ## How was this patch tested
      Unit tests, and cluster tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12154 from tdas/streaming-dynamic-allocation.
      9af5423e
    • Marcelo Vanzin's avatar
      [SPARK-14391][LAUNCHER] Increase test timeouts. · de479260
      Marcelo Vanzin authored
      Most of the time tests should still pass really quickly; it's just
      when machines are overloaded that the tests may take a little time,
      but that's still preferable over just failing the test.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12210 from vanzin/SPARK-14391.
      de479260
    • Davies Liu's avatar
      [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet reader for wide table · 5a4b11a9
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions.
      2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan
      3) Disable the returning columnar batch in parquet reader if there are many columns.
      4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen.
      
      Closes #12098
      
      ## How was this patch tested?
      
      Add a tests for table with 1000 columns.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12047 from davies/many_columns.
      5a4b11a9
    • Shixiong Zhu's avatar
      [SPARK-14382][SQL] QueryProgress should be post after committedOffsets is updated · a4ead6d3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Make sure QueryProgress is post after committedOffsets is updated. If QueryProgress is post before committedOffsets is updated, the listener may see a wrong sinkStatus (created from committedOffsets).
      
      See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/644/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/single_listener/ for an example of the failure.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12155 from zsxwing/SPARK-14382.
      a4ead6d3
    • Bryan Cutler's avatar
      [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression · 9c6556c5
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
      
      ## How was this patch tested?
      Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
      9c6556c5
    • Sameer Agarwal's avatar
      [SPARK-14320][SQL] Make ColumnarBatch.Row mutable · bb1fa5b2
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      In order to leverage a data structure like `AggregateHashMap` (https://github.com/apache/spark/pull/12055) to speed up aggregates with keys, we need to make `ColumnarBatch.Row` mutable.
      
      ## How was this patch tested?
      
      Unit test in `ColumnarBatchSuite`. Also, tested via `BenchmarkWholeStageCodegen`.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12103 from sameeragarwal/mutable-row.
      bb1fa5b2
    • Zheng RuiFeng's avatar
      [SPARK-13538][ML] Add GaussianMixture to ML · af73d973
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13538
      
      ## What changes were proposed in this pull request?
      
      Add GaussianMixture and GaussianMixtureModel to ML package
      
      ## How was this patch tested?
      
      unit tests and manual tests were done.
      Local Scalastyle checks passed.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11419 from zhengruifeng/mlgmm.
      af73d973
    • Yuhao Yang's avatar
      [SPARK-14322][MLLIB] Use treeAggregate instead of reduce in OnlineLDAOptimizer · 8cffcb60
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-14322
      
      OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix.
      This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate.
      See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452
      and a few lines below it.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #12106 from hhbyyh/ldaTreeReduce.
      8cffcb60
    • Xusen Yin's avatar
      [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning · db0b06c6
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13786
      
      Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
      
      ## How was this patch tested?
      
      Test with Python doctest.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #12020 from yinxusen/SPARK-13786.
      db0b06c6
    • bomeng's avatar
      [SPARK-14383][SQL] missing "|" in the g4 file · 3c8d8821
      bomeng authored
      ## What changes were proposed in this pull request?
      
      A very trivial one. It missed "|" between DISTRIBUTE and UNSET.
      
      ## How was this patch tested?
      
      I do not think it is really needed.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12156 from bomeng/SPARK-14383.
      3c8d8821
    • bomeng's avatar
      [SPARK-14429][SQL] Improve LIKE pattern in "SHOW TABLES / FUNCTIONS LIKE <pattern>" DDL · 5abd02c0
      bomeng authored
      LIKE <pattern> is commonly used in SHOW TABLES / FUNCTIONS etc DDL. In the pattern, user can use `|` or `*` as wildcards.
      
      1. Currently, we used `replaceAll()` to replace `*` with `.*`, but the replacement was scattered in several places; I have created an utility method and use it in all the places;
      
      2. Consistency with Hive: the pattern is case insensitive in Hive and white spaces will be trimmed, but current pattern matching does not do that. For example, suppose we have tables (t1, t2, t3), `SHOW TABLES LIKE ' T* ' ` will list all the t-tables. Please use Hive to verify it.
      
      3. Combined with `|`, the result will be sorted. For pattern like `'  B*|a*  '`, it will list the result in a-b order.
      
      I've made some changes to the utility method to make sure we will get the same result as Hive does.
      
      A new method was created in StringUtil and test cases were added.
      
      andrewor14
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12206 from bomeng/SPARK-14429.
      5abd02c0
    • Kousuke Saruta's avatar
      [SPARK-14426][SQL] Merge PerserUtils and ParseUtils · 10494fea
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      We have ParserUtils and ParseUtils which are both utility collections for use during the parsing process.
      Those names and what they are used for is very similar so I think we can merge them.
      
      Also, the original unescapeSQLString method may have a fault. When "\u0061" style character literals are passed to the method, it's not unescaped successfully.
      This patch fix the bug.
      
      ## How was this patch tested?
      
      Added a new test case.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12199 from sarutak/merge-ParseUtils-and-ParserUtils.
      10494fea
    • Davies Liu's avatar
      [SPARK-14418][PYSPARK] fix unpersist of Broadcast in Python · 90ca1844
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy().
      
      This PR added destroy() for Broadcast in Python, to match the sematics in Scala.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12189 from davies/py_unpersist.
      90ca1844
    • Michael Armbrust's avatar
      [SPARK-14288][SQL] Memory Sink for streaming · 59236e5c
      Michael Armbrust authored
      This PR exposes the internal testing `MemorySink` though the data source API.  This will allow users to easily test streaming applications in the Spark shell or other local tests.
      
      Usage:
      ```scala
      inputStream.write
        .format("memory")
        .queryName("memStream")
        .startStream()
      
      // Now you can query the result of the stream here.
      sqlContext.table("memStream")
      ```
      
      The most complicated part of the logic is choosing the checkpoint directory.  There are a few requirements we are attempting to satisfy here:
       - when working in the shell locally, it should just work with no extra configuration.
       - when working on a cluster you should be able to make it easily create the checkpoint on a distributed file system so you can test aggregation (state checkpoints are also stored in this directory and must be accessible from workers).
       - it should be clear that you can't resume since the data is just in memory.
      
      The chosen algorithm proceeds as follows:
       - the user gives a checkpoint directory, use it
       - if the conf has a checkpoint location, use `$location/$queryName`
       - if neither, create a local directory
       - always check to make sure there are no offsets written to the directory
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #12119 from marmbrus/memorySink.
      59236e5c
    • Prajwal Tuladhar's avatar
      [SPARK-14430][BUILD] use https while downloading binaries from build/mvn · 5e64dab8
      Prajwal Tuladhar authored
      ## What changes were proposed in this pull request?
      
      `./build/mvn` file was downloading binaries in non HTTPS mode. This PR tends to fix it.
      
      ## How was this patch tested?
      
      By running `./build/mvn clean package` locally
      
      Author: Prajwal Tuladhar <praj@infynyxx.com>
      
      Closes #12182 from infynyxx/mvn_use_https.
      5e64dab8
    • Victor Chima's avatar
      Added omitted word in error message · 24015199
      Victor Chima authored
      ## What changes were proposed in this pull request?
      
      Added an omitted word in the error message displayed by the Graphx Pregel API when `maxIterations <= 0`
      
      ## How was this patch tested?
      
      Manual test
      
      Author: Victor Chima <blazy2k9@gmail.com>
      
      Closes #12205 from blazy2k9/hotfix/pregel-error-message.
      24015199
    • gatorsmile's avatar
      [SPARK-14396][BUILD][HOT] Fix compilation against Scala 2.10 · 25a4c8e0
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to fix the compilation errors in Scala 2.10 build, as shown in the link:
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-compile-maven-scala-2.10/735/console
      ```
      [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:266: value contains is not a member of Option[String]
      [error]     assert(desc.viewText.contains("SELECT * FROM tab1"))
      [error]                          ^
      [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:267: value contains is not a member of Option[String]
      [error]     assert(desc.viewOriginalText.contains("SELECT * FROM tab1"))
      [error]                                  ^
      [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:293: value contains is not a member of Option[String]
      [error]     assert(desc.viewText.contains("SELECT * FROM tab1"))
      [error]                          ^
      [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:294: value contains is not a member of Option[String]
      [error]     assert(desc.viewOriginalText.contains("SELECT * FROM tab1"))
      [error]                                  ^
      [error] four errors found
      [error] Compile failed at Apr 5, 2016 10:59:09 PM [10.502s]
      ```
      
      #### How was this patch tested?
      Not sure how to trigger Scala 2.10 compilation in the test environment.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12201 from gatorsmile/buildBreak2.10.
      25a4c8e0
    • Eric Liang's avatar
      [SPARK-14252] Executors do not try to download remote cached blocks · 78c1076d
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      As mentioned in the ticket this was because one get path in the refactored `BlockManager` did not check for remote storage.
      
      ## How was this patch tested?
      
      Unit test, also verified manually with reproduction in the ticket.
      
      cc JoshRosen
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #12193 from ericl/spark-14252.
      78c1076d
    • gatorsmile's avatar
      [SPARK-14396][SQL] Throw Exceptions for DDLs of Partitioned Views · 68be5b9e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
      ```
      ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
      
      ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
      
      CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
        [COMMENT view_comment]
        [TBLPROPERTIES (property_name = property_value, ...)]
        AS SELECT ...;
      ```
      
      An exception is thrown when users issue any of these three DDL commands.
      
      #### How was this patch tested?
      Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12169 from gatorsmile/viewPartition.
      68be5b9e
    • Shixiong Zhu's avatar
      [SPARK-14416][CORE] Add thread-safe comments for CoarseGrainedSchedulerBackend's fields · 48467f4e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12188 from zsxwing/comments.
      48467f4e
  2. Apr 05, 2016
    • Andrew Or's avatar
      [SPARK-14128][SQL] Alter table DDL followup · adbfdb87
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This is just a followup to #12121, which implemented the alter table DDLs using the `SessionCatalog`. Specially, this corrects the behavior of setting the location of a datasource table. For datasource tables, we need to set the `locationUri` in addition to the `path` entry in the serde properties. Additionally, changing the location of a datasource table partition is not allowed.
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12186 from andrewor14/alter-table-ddl-followup.
      adbfdb87
    • Wenchen Fan's avatar
      [SPARK-14296][SQL] whole stage codegen support for Dataset.map · f6456fa8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework.
      
      ## How was this patch tested?
      
      new test in `WholeStageCodegenSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12087 from cloud-fan/map.
      f6456fa8
    • Sean Owen's avatar
      [SPARK-13211][STREAMING] StreamingContext throws NoSuchElementException when... · 8e5c1cbf
      Sean Owen authored
      [SPARK-13211][STREAMING] StreamingContext throws NoSuchElementException when created from non-existent checkpoint directory
      
      ## What changes were proposed in this pull request?
      
      Take 2: avoid None.get NoSuchElementException in favor of more descriptive IllegalArgumentException if a non-existent checkpoint dir is used without a SparkContext
      
      ## How was this patch tested?
      
      Jenkins test plus new test for this particular case
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12174 from srowen/SPARK-13211.
      8e5c1cbf
    • Eric Liang's avatar
      [SPARK-14359] Unit tests for java 8 lambda syntax with typed aggregates · 7d29c72f
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Adds unit tests for java 8 lambda syntax with typed aggregates as a follow-up to #12168
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #12181 from ericl/sc-2794-2.
      7d29c72f
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for R · 1146c534
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the R API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python and R, users can access all APIs above, but in addition they can do
       - In R:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12141 from brkyvz/R-windows.
      1146c534
    • Dongjoon Hyun's avatar
      [HOTFIX] Fix `optional` to `createOptional`. · 48682f6b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following line.
      ```
         private[spark] val STAGING_DIR = ConfigBuilder("spark.yarn.stagingDir")
           .doc("Staging directory used while submitting applications.")
           .stringConf
      -    .optional
      +    .createOptional
      ```
      
      ## How was this patch tested?
      
      Pass the build.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12187 from dongjoon-hyun/hotfix.
      48682f6b
    • Marcelo Vanzin's avatar
      [SPARK-529][SQL] Modify SQLConf to use new config API from core. · d5ee9d5c
      Marcelo Vanzin authored
      Because SQL keeps track of all known configs, some customization was
      needed in SQLConf to allow that, since the core API does not have that
      feature.
      
      Tested via existing (and slightly updated) unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11570 from vanzin/SPARK-529-sql.
      d5ee9d5c
    • Shixiong Zhu's avatar
      [SPARK-14411][SQL] Add a note to warn that onQueryProgress is asynchronous · 7329fe27
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      onQueryProgress is asynchronous so the user may see some future status of `ContinuousQuery`. This PR just updated comments to warn it.
      
      ## How was this patch tested?
      
      Only updated comments.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12180 from zsxwing/ContinuousQueryListener-doc.
      7329fe27
    • Andrew Or's avatar
      [SPARK-14129][SPARK-14128][SQL] Alter table DDL commands · 45d8cdee
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      In Spark 2.0, we want to handle the most common `ALTER TABLE` commands ourselves instead of passing the entire query text to Hive. This is done using the new `SessionCatalog` API introduced recently.
      
      The commands supported in this patch include:
      ```
      ALTER TABLE ... RENAME TO ...
      ALTER TABLE ... SET TBLPROPERTIES ...
      ALTER TABLE ... UNSET TBLPROPERTIES ...
      ALTER TABLE ... SET LOCATION ...
      ALTER TABLE ... SET SERDE ...
      ```
      The commands we explicitly do not support are:
      ```
      ALTER TABLE ... CLUSTERED BY ...
      ALTER TABLE ... SKEWED BY ...
      ALTER TABLE ... NOT CLUSTERED
      ALTER TABLE ... NOT SORTED
      ALTER TABLE ... NOT SKEWED
      ALTER TABLE ... NOT STORED AS DIRECTORIES
      ```
      For these we throw exceptions complaining that they are not supported.
      
      ## How was this patch tested?
      
      `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12121 from andrewor14/alter-table-ddl.
      45d8cdee
    • Dongjoon Hyun's avatar
      [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string · c59abad0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Current, SparkSQL `initCap` is using `toTitleCase` function. However, `UTF8String.toTitleCase` implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation `toTitleCase`.
      ```
      hive> select initcap('sParK');
      Spark
      ```
      ```
      scala> sql("select initcap('sParK')").head
      res0: org.apache.spark.sql.Row = [SParK]
      ```
      
      This PR updates the implementation of `initcap` using `toLowerCase` and `toTitleCase`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12175 from dongjoon-hyun/SPARK-14402.
      c59abad0
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for Python, and SQL · 9ee5c257
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the Python, and SQL, API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python, users can access all APIs above, but in addition they can do
       - In Python:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12136 from brkyvz/python-windows.
      9ee5c257
    • Yin Huai's avatar
      [SPARK-14123][SPARK-14384][SQL] Handle CreateFunction/DropFunction · 72544d6f
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes.
      * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted.
      * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`.
      * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`.
      
      This PR is based on viirya's https://github.com/apache/spark/pull/12036/
      
      ## How was this patch tested?
      Existing tests and new tests.
      
      ## TODOs
      [x] Self-review
      [x] Cleanup
      [x] More tests for create/drop functions (we need to more tests for permanent functions).
      [ ] File JIRAs for all TODOs
      [x] Standardize the error message when a function does not exist.
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12117 from yhuai/function.
      72544d6f
    • Devaraj K's avatar
      [SPARK-13063][YARN] Make the SPARK YARN STAGING DIR as configurable · bc36df12
      Devaraj K authored
      ## What changes were proposed in this pull request?
      Made the SPARK YARN STAGING DIR as configurable with the configuration as 'spark.yarn.staging-dir'.
      
      ## How was this patch tested?
      
      I have verified it manually by running applications on yarn, If the 'spark.yarn.staging-dir' is configured then the value used as staging directory otherwise uses the default value i.e. file system’s home directory for the user.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #12082 from devaraj-kavali/SPARK-13063.
      bc36df12
    • Shixiong Zhu's avatar
      [SPARK-14257][SQL] Allow multiple continuous queries to be started from the same DataFrame · 463bac00
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame.
      
      ## How was this patch tested?
      
      `test("DataFrame reuse")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12049 from zsxwing/df-reuse.
      463bac00
    • Wenchen Fan's avatar
      [SPARK-14345][SQL] Decouple deserializer expression resolution from ObjectOperator · f77f11c6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR decouples deserializer expression resolution from `ObjectOperator`, so that we can use deserializer expression in normal operators. This is needed by #12061 and #12067 , I abstracted the logic out and put them in this PR to reduce code change in the future.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12131 from cloud-fan/separate.
      f77f11c6
    • Kousuke Saruta's avatar
      [SPARK-14397][WEBUI] <html> and <body> tags are nested in LogPage · e4bd5041
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      In `LogPage`, the content to be rendered is defined as follows.
      
      ```
          val content =
            <html>
              <body>
                {linkToMaster}
                <div>
                  <div style="float:left; margin-right:10px">{backButton}</div>
                  <div style="float:left;">{range}</div>
                  <div style="float:right; margin-left:10px">{nextButton}</div>
                </div>
                <br />
                <div style="height:500px; overflow:auto; padding:5px;">
                  <pre>{logText}</pre>
                </div>
              </body>
            </html>
          UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
      ```
      
      As you can see, <html> and <body> tags will be rendered.
      
      On the other hand, `UIUtils.basicSparkPage` will render those tags so those tags will be nested.
      
      ```
        def basicSparkPage(
            content: => Seq[Node],
            title: String,
            useDataTables: Boolean = false): Seq[Node] = {
          <html>
            <head>
              {commonHeaderNodes}
              {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
              <title>{title}</title>
            </head>
            <body>
              <div class="container-fluid">
                <div class="row-fluid">
                  <div class="span12">
                    <h3 style="vertical-align: middle; display: inline-block;">
                      <a style="text-decoration: none" href={prependBaseUri("/")}>
                        <img src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
                        <span class="version"
                              style="margin-right: 15px;">{org.apache.spark.SPARK_VERSION}</span>
                      </a>
                      {title}
                    </h3>
                  </div>
                </div>
                {content}
              </div>
            </body>
          </html>
        }
      ```
      
      These are the screen shots before this patch is applied.
      
      ![before1](https://cloud.githubusercontent.com/assets/4736016/14273236/03cbed8a-fb44-11e5-8786-bc1bfa4d3f8c.png)
      ![before2](https://cloud.githubusercontent.com/assets/4736016/14273237/03d1741c-fb44-11e5-9dee-ea93022033a6.png)
      
      And these are the ones after this patch is applied.
      
      ![after1](https://cloud.githubusercontent.com/assets/4736016/14273248/1b6a7d8a-fb44-11e5-8a3b-69964f3434f6.png)
      ![after2](https://cloud.githubusercontent.com/assets/4736016/14273249/1b6b9c38-fb44-11e5-9d6f-281d64c842e4.png)
      
      The appearance is not changed but the html source code is changed.
      
      ## How was this patch tested?
      
      Manually run some jobs on my standalone-cluster and check the WebUI.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12170 from sarutak/SPARK-14397.
      e4bd5041
    • Shally Sangal's avatar
      [SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizes · d3569015
      Shally Sangal authored
      ## What changes were proposed in this pull request?
      
      KMeansSummary class : deprecated size and added clusterSizes
      
      Author: Shally Sangal <shallysangal@gmail.com>
      
      Closes #12084 from shallys/master.
      d3569015
    • gatorsmile's avatar
      [SPARK-14349][SQL] Issue Error Messages for Unsupported Operators/DML/DDL in SQL Context. · 78071736
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context.
      
      For example,
      - When calling `Drop Table` in SQL Context, we got the following message:
      ```
      Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown.
      ```
      
      - When calling `Script Transform` in SQL Context, we got the message:
      ```
      assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null
      +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187
      ```
      
      Updates:
      Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell !
      
      #### How was this patch tested?
      A few test cases are added.
      
      Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12134 from gatorsmile/hiveParserCommand.
      78071736
    • Dilip Biswal's avatar
      [SPARK-14348][SQL] Support native execution of SHOW TBLPROPERTIES command · 2715bc68
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      This PR adds Native execution of SHOW TBLPROPERTIES command.
      
      Command Syntax:
      ``` SQL
      SHOW TBLPROPERTIES table_name[(property_key_literal)]
      ```
      ## How was this patch tested?
      
      Tests added in HiveComandSuiie and DDLCommandSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #12133 from dilipbiswal/dkb_show_tblproperties.
      2715bc68
Loading