Skip to content
Snippets Groups Projects
  1. Feb 27, 2017
    • hyukjinkwon's avatar
      [SPARK-15615][SQL][BUILD][FOLLOW-UP] Replace deprecated usage of json(RDD[String]) API · 8a5a5850
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to replace the deprecated `json(RDD[String])` usage to `json(Dataset[String])`.
      
      This currently produces so many warnings.
      
      ## How was this patch tested?
      
      Fixed tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17071 from HyukjinKwon/SPARK-15615-followup.
      8a5a5850
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix lint-java breaks in Java · 4ba9c6c4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the lint-breaks as below:
      
      ```
      [ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer.
      [ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers.
      [ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105).
      [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed.
      [ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121).
      [ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113).
      [ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122).
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time.
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110).
      [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114).
      [ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
      [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103).
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1.
      [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104).
      [ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101).
      [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD.
      [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext.
      ```
      
      ## How was this patch tested?
      
      Manually via
      
      ```bash
      ./dev/lint-java
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17072 from HyukjinKwon/java-lint.
      4ba9c6c4
  2. Feb 26, 2017
    • Eyal Zituny's avatar
      [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle... · 9f8e3921
      Eyal Zituny authored
      [SPARK-19594][STRUCTURED STREAMING] StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
      
      ## What changes were proposed in this pull request?
      
      currently if multiple streaming queries listeners exists, when a QueryTerminatedEvent is triggered, only one of the listeners will be invoked while the rest of the listeners will ignore the event.
      this is caused since the the streaming queries listeners bus holds a set of running queries ids and when a termination event is triggered, after the first listeners is handling the event, the terminated query id is being removed from the set.
      in this PR, the query id will be removed from the set only after all the listeners handles the event
      
      ## How was this patch tested?
      
      a test with multiple listeners has been added to StreamingQueryListenerSuite
      
      Author: Eyal Zituny <eyal.zituny@equalum.io>
      
      Closes #16991 from eyalzit/master.
      9f8e3921
    • Dilip Biswal's avatar
      [SQL] Duplicate test exception in SQLQueryTestSuite due to meta files(.DS_Store) on Mac · 68f2142c
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      After adding the tests for subquery, we now have multiple level of directories under "sql-tests/inputs".  Some times on Mac while using Finder application it creates the meta data files called ".DS_Store". When these files are present at different levels in directory hierarchy, we get duplicate test exception while running the tests  as we just use the file name as the test case name. In this PR, we use the relative file path from the base directory along with the test file as the test name. Also after this change, we can have the same test file name under different directory like exists/basic.sql , in/basic.sql. Here is the truncated output of the test run after the change.
      
      ```SQL
      info] SQLQueryTestSuite:
      [info] - arithmetic.sql (5 seconds, 235 milliseconds)
      [info] - array.sql (536 milliseconds)
      [info] - blacklist.sql !!! IGNORED !!!
      [info] - cast.sql (550 milliseconds)
      ....
      ....
      ....
      [info] - union.sql (315 milliseconds)
      [info] - subquery/.DS_Store !!! IGNORED !!!
      [info] - subquery/exists-subquery/.DS_Store !!! IGNORED !!!
      [info] - subquery/exists-subquery/exists-aggregate.sql (2 seconds, 451 milliseconds)
      ....
      ....
      [info] - subquery/in-subquery/in-group-by.sql (12 seconds, 264 milliseconds)
      ....
      ....
      [info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (7 seconds, 769 milliseconds)
      [info] - subquery/scalar-subquery/scalar-subquery-select.sql (4 seconds, 119 milliseconds)
      ```
      Since this is a simple change, i haven't created a JIRA for it.
      ## How was this patch tested?
      Manually verified. This is change to test infrastructure
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #17060 from dilipbiswal/sqlquerytestsuite.
      68f2142c
    • Wenchen Fan's avatar
      [SPARK-17075][SQL][FOLLOWUP] fix some minor issues and clean up the code · 89608cf2
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17065 from cloud-fan/follow-up.
      89608cf2
    • Joseph K. Bradley's avatar
      [MINOR][ML][DOC] Document default value for GeneralizedLinearRegression.linkPower · 6ab60542
      Joseph K. Bradley authored
      Add Scaladoc for GeneralizedLinearRegression.linkPower default value
      
      Follow-up to https://github.com/apache/spark/pull/16344
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17069 from jkbradley/tweedie-comment.
      6ab60542
  3. Feb 25, 2017
    • Devaraj K's avatar
      [SPARK-15288][MESOS] Mesos dispatcher should handle gracefully when any thread... · 410392ed
      Devaraj K authored
      [SPARK-15288][MESOS] Mesos dispatcher should handle gracefully when any thread gets UncaughtException
      
      ## What changes were proposed in this pull request?
      
      Adding the default UncaughtExceptionHandler to the MesosClusterDispatcher.
      ## How was this patch tested?
      
      I verified it manually, when any of the dispatcher thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13072 from devaraj-kavali/SPARK-15288.
      410392ed
    • lvdongr's avatar
      [SPARK-19673][SQL] "ThriftServer default app name is changed wrong" · fe07de95
      lvdongr authored
      ## What changes were proposed in this pull request?
      In spark 1.x ,the name of ThriftServer is SparkSQL:localHostName. While the ThriftServer default name is changed to the className of HiveThfift2 , which is not appropriate.
      
      ## How was this patch tested?
      manual tests
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: lvdongr <lv.dongdong@zte.com.cn>
      
      Closes #17010 from lvdongr/ThriftserverName.
      fe07de95
    • Boaz Mohar's avatar
      [MINOR][DOCS] Fixes two problems in the SQL programing guide page · 061bcfb8
      Boaz Mohar authored
      ## What changes were proposed in this pull request?
      
      Removed duplicated lines in sql python example and found a typo.
      
      ## How was this patch tested?
      
      Searched for other typo's in the page to minimize PR's.
      
      Author: Boaz Mohar <boazmohar@gmail.com>
      
      Closes #17066 from boazmohar/doc-fix.
      061bcfb8
    • Herman van Hovell's avatar
      [SPARK-19650] Commands should not trigger a Spark job · 8f0511ed
      Herman van Hovell authored
      Spark executes SQL commands eagerly. It does this by creating an RDD which contains the command's results. The downside to this is that any action on this RDD triggers a Spark job which is expensive and is unnecessary.
      
      This PR fixes this by avoiding the materialization of an `RDD` for `Command`s; it just materializes the result and puts them in a `LocalRelation`.
      
      Added a regression test to `SQLQuerySuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17027 from hvanhovell/no-job-command.
      8f0511ed
    • Xiao Li's avatar
      [SPARK-19735][SQL] Remove HOLD_DDLTIME from Catalog APIs · 4cb025af
      Xiao Li authored
      ### What changes were proposed in this pull request?
      As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224, HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should also remove it from our Catalog APIs.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17063 from gatorsmile/removalHoldDDLTime.
      4cb025af
  4. Feb 24, 2017
    • Ramkumar Venkataraman's avatar
      [MINOR][DOCS] Fix few typos in structured streaming doc · 1b9ba258
      Ramkumar Venkataraman authored
      ## What changes were proposed in this pull request?
      
      Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.
      
      ## How was this patch tested?
      
      N/A - since this is a doc fix. I did a jekyll build locally though.
      
      Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>
      
      Closes #17037 from ramkumarvenkat/doc-fix.
      1b9ba258
    • Shubham Chopra's avatar
      [SPARK-15355][CORE] Proactive block replication · fa7c582e
      Shubham Chopra authored
      ## What changes were proposed in this pull request?
      
      We are proposing addition of pro-active block replication in case of executor failures. BlockManagerMasterEndpoint does all the book-keeping to keep a track of all the executors and the blocks they hold. It also keeps a track of which executors are alive through heartbeats. When an executor is removed, all this book-keeping state is updated to reflect the lost executor. This step can be used to identify executors that are still in possession of a copy of the cached data and a message could be sent to them to use the existing "replicate" function to find and place new replicas on other suitable hosts. Blocks replicated this way will let the master know of their existence.
      
      This can happen when an executor is lost, and would that way be pro-active as opposed be being done at query time.
      ## How was this patch tested?
      
      This patch was tested with existing unit tests along with new unit tests added to test the functionality.
      
      Author: Shubham Chopra <schopra31@bloomberg.net>
      
      Closes #14412 from shubhamchopra/ProactiveBlockReplication.
      fa7c582e
    • Jeff Zhang's avatar
      [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker · 330c3e33
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      self.environment will be propagated to executor. Should set PYTHONHASHSEED as long as the python version is greater than 3.3
      
      ## How was this patch tested?
      Manually tested it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #11211 from zjffdu/SPARK-13330.
      330c3e33
    • Imran Rashid's avatar
      [SPARK-19597][CORE] test case for task deserialization errors · 5f74148b
      Imran Rashid authored
      Adds a test case that ensures that Executors gracefully handle a task that fails to deserialize, by sending back a reasonable failure message.  This does not change any behavior (the prior behavior was already correct), it just adds a test case to prevent regression.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #16930 from squito/executor_task_deserialization.
      5f74148b
    • Kay Ousterhout's avatar
      [SPARK-19560] Improve DAGScheduler tests. · 5cbd3b59
      Kay Ousterhout authored
      This commit improves the tests that check the case when a
      ShuffleMapTask completes successfully on an executor that has
      failed.  This commit improves the commenting around the existing
      test for this, and adds some additional checks to make it more
      clear what went wrong if the tests fail (the fact that these
      tests are hard to understand came up in the context of markhamstra's
      proposed fix for #16620).
      
      This commit also removes a test that I realized tested exactly
      the same functionality.
      
      markhamstra, I verified that the new version of the test still fails (and
      in a more helpful way) for your proposed change for #16620.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #16892 from kayousterhout/SPARK-19560.
      5cbd3b59
    • wangzhenhua's avatar
      [SPARK-17078][SQL] Show stats when explain · 69d0da63
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      Currently we can only check the estimated stats in logical plans by debugging. We need to provide an easier and more efficient way for developers/users.
      
      In this pr, we add EXPLAIN COST command to show stats in the optimized logical plan.
      E.g.
      ```
      spark-sql> EXPLAIN COST select count(1) from store_returns;
      
      ...
      == Optimized Logical Plan ==
      Aggregate [count(1) AS count(1)#24L], Statistics(sizeInBytes=16.0 B, rowCount=1, isBroadcastable=false)
      +- Project, Statistics(sizeInBytes=4.3 GB, rowCount=5.76E+8, isBroadcastable=false)
         +- Relation[sr_returned_date_sk#3,sr_return_time_sk#4,sr_item_sk#5,sr_customer_sk#6,sr_cdemo_sk#7,sr_hdemo_sk#8,sr_addr_sk#9,sr_store_sk#10,sr_reason_sk#11,sr_ticket_number#12,sr_return_quantity#13,sr_return_amt#14,sr_return_tax#15,sr_return_amt_inc_tax#16,sr_fee#17,sr_return_ship_cost#18,sr_refunded_cash#19,sr_reversed_charge#20,sr_store_credit#21,sr_net_loss#22] parquet, Statistics(sizeInBytes=28.6 GB, rowCount=5.76E+8, isBroadcastable=false)
      ...
      ```
      
      ## How was this patch tested?
      
      Add test cases.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #16594 from wzhfy/showStats.
      69d0da63
    • Shuai Lin's avatar
      [SPARK-17075][SQL] Follow up: fix file line ending and improve the tests · 05954f32
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Fixed the line ending of `FilterEstimation.scala` (It's still using `\n\r`). Also improved the tests to cover the cases where the literals are on the left side of a binary operator.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #17051 from lins05/fix-cbo-filter-file-encoding.
      05954f32
    • Tejas Patil's avatar
      [SPARK-17495][SQL] Add more tests for hive hash · 3e40f6c3
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:
      - null
      - boolean
      - byte
      - short
      - int
      - long
      - float
      - double
      - string
      - array
      - map
      - struct
      
      Datatypes that I have _NOT_ covered but I will work on separately are:
      - Decimal (handled separately in https://github.com/apache/spark/pull/17056)
      - TimestampType
      - DateType
      - CalendarIntervalType
      
      ## How was this patch tested?
      
      NA
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #17049 from tejasapatil/SPARK-17495_remaining_types.
      3e40f6c3
    • jerryshao's avatar
      [SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-client · a920a436
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Because yarn#client will reset the `spark.yarn.keytab` configuration to point to the location in distributed file, so if user still uses the old `SparkConf` to create `SparkSession` with Hive enabled, it will read keytab from the path in distributed cached. This is OK for yarn cluster mode, but in yarn client mode where driver is running out of container, it will be failed to fetch the keytab.
      
      So here we should avoid reseting this configuration in the `yarn#client` and only overwriting it for AM, so using `spark.yarn.keytab` could get correct keytab path no matter running in client (keytab in local fs) or cluster (keytab in distributed cache) mode.
      
      ## How was this patch tested?
      
      Verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16923 from jerryshao/SPARK-19038.
      a920a436
    • jerryshao's avatar
      [SPARK-19707][CORE] Improve the invalid path check for sc.addJar · b0a8c16f
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Currently in Spark there're two issues when we add jars with invalid path:
      
      * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will resolve it to the current directory path and add to classpath / file server, which is unwanted. This is happened in our programatic way to submit Spark application. From my understanding Spark should defensively filter out such empty path.
      * If the jar path is a invalid path (file doesn't exist), `addJar` doesn't check it and will still add to file server, the exception will be delayed until job running. Actually this local path could be checked beforehand, no need to wait until task running. We have similar check in `addFile`, but lacks similar similar mechanism in `addJar`.
      
      ## How was this patch tested?
      
      Add unit test and local manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17038 from jerryshao/SPARK-19707.
      b0a8c16f
    • zero323's avatar
      [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings · 4a5e38f5
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161).
      
      ### Backward incompatible changes:
      
      - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic  `UserDefinedFunction` API (`func` and `returnType` attributes).  This should have a minimal impact on the user code.
      
        An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.
      
      ### Limitations:
      
      Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.
      
      ### Possible impact on other tickets:
      
      This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777).
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16534 from zero323/SPARK-19161.
      4a5e38f5
    • windpiger's avatar
      [SPARK-19664][SQL] put hive.metastore.warehouse.dir in hadoopconf to overwrite its original value · 8f33731e
      windpiger authored
      ## What changes were proposed in this pull request?
      
      In [SPARK-15959](https://issues.apache.org/jira/browse/SPARK-15959), we bring back the `hive.metastore.warehouse.dir` , while in the logic, when use the value of  `spark.sql.warehouse.dir` to overwrite `hive.metastore.warehouse.dir` , it set it to `sparkContext.conf` which does not overwrite the value is hadoopConf, I think it should put in `sparkContext.hadoopConfiguration` and overwrite the original value of hadoopConf
      
      https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64
      
      ## How was this patch tested?
      N/A
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16996 from windpiger/hivemetawarehouseConf.
      8f33731e
  5. Feb 23, 2017
    • Ron Hu's avatar
      [SPARK-17075][SQL] implemented filter estimation · d7e43b61
      Ron Hu authored
      ## What changes were proposed in this pull request?
      
      We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator.
      
      ## How was this patch tested?
      
      We add a new test suite to test various logical operators.
      
      Author: Ron Hu <ron.hu@huawei.com>
      
      Closes #16395 from ron8hu/filterSelectivity.
      d7e43b61
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 2f69e3f6
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.
      2f69e3f6
    • uncleGen's avatar
      [SPARK-16122][DOCS] application environment rest api · d0276245
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      follow up pr of #16949.
      
      ## How was this patch tested?
      
      jenkins
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17033 from uncleGen/doc-restapi-environment.
      d0276245
    • Carson Wang's avatar
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the... · eff7b408
      Carson Wang authored
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
      
      ## What changes were proposed in this pull request?
      In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException.
      
      ## How was this patch tested?
      Updated unit test.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #17009 from carsonwang/FixSQLMetrics.
      eff7b408
    • Kay Ousterhout's avatar
      [SPARK-19684][DOCS] Remove developer info from docs. · f87a6a59
      Kay Ousterhout authored
      This commit moves developer-specific information from the release-
      specific documentation in this repo to the developer tools page on
      the main Spark website. This commit relies on this PR on the
      Spark website: https://github.com/apache/spark-website/pull/33.
      
      srowen
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #17018 from kayousterhout/SPARK-19684.
      f87a6a59
    • Wenchen Fan's avatar
      [SPARK-19706][PYSPARK] add Column.contains in pyspark · 4fa4cf1d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      to be consistent with the scala API, we should also add `contains` to `Column` in pyspark.
      
      ## How was this patch tested?
      
      updated unit test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17036 from cloud-fan/pyspark.
      4fa4cf1d
    • Takeshi Yamamuro's avatar
      [SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data · 09ed6e77
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added a logic to put malformed tokens into a new field when parsing CSV data  in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
      ```
      Caused by: java.lang.IllegalArgumentException
      	at java.sql.Date.valueOf(Date.java:143)
      	at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
      	at scala.util.Try.getOrElse(Try.scala:79)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
      	at
      ```
      In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
      
      ## How was this patch tested?
      Added tests in `CSVSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #16928 from maropu/SPARK-18699-2.
      09ed6e77
    • Shixiong Zhu's avatar
      [SPARK-19497][SS] Implement streaming deduplication · 9bf4e2ba
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
      
      The following cases are supported:
      
      - one or multiple `dropDuplicates()` without aggregation (with or without watermark)
      - `dropDuplicates` before aggregation
      
      Not supported cases:
      
      - `dropDuplicates` after aggregation
      
      Breaking changes:
      - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16970 from zsxwing/dedup.
      9bf4e2ba
    • actuaryzhang's avatar
      [SPARK-19682][SPARKR] Issue warning (or error) when subset method "[[" takes vector index · 7bf09433
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      The `[[` method is supposed to take a single index and return a column. This is different from base R which takes a vector index.  We should check for this and issue warning or error when vector index is supplied (which is very likely given the behavior in base R).
      
      Currently I'm issuing a warning message and just take the first element of the vector index. We could change this to an error it that's better.
      
      ## How was this patch tested?
      new tests
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17017 from actuaryzhang/sparkRSubsetter.
      7bf09433
    • Herman van Hovell's avatar
      [SPARK-19459] Support for nested char/varchar fields in ORC · 78eae7e6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR is a small follow-up on https://github.com/apache/spark/pull/16804. This PR also adds support for nested char/varchar fields in orc.
      
      ## How was this patch tested?
      I have added a regression test to the OrcSourceSuite.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17030 from hvanhovell/SPARK-19459-follow-up.
      78eae7e6
    • Takeshi Yamamuro's avatar
      [SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column · 93aa4271
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr fixed a class-cast exception below;
      ```
      scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect()
       java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number
      	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
      	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
      	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
      	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
      	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
      	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
      	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
      	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
      	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
      	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
      	at
      ```
      This fix simply converts catalyst values (i.e., `Decimal`) into scala ones by using `CatalystTypeConverters`.
      
      ## How was this patch tested?
      Added a test in `DataFrameSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17028 from maropu/SPARK-19691.
      93aa4271
  6. Feb 22, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord` field... · 769aa0f1
      Takeshi Yamamuro authored
      [SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in json formats
      
      ## What changes were proposed in this pull request?
      This pr comes from #16928 and fixed a json behaviour along with the CSV one.
      
      ## How was this patch tested?
      Added tests in `JsonSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17023 from maropu/SPARK-19695.
      769aa0f1
    • uncleGen's avatar
      [SPARK-16122][CORE] Add rest api for job environment · 66c4b79a
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      add rest api for job environment.
      
      ## How was this patch tested?
      
      existing ut.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16949 from uncleGen/SPARK-16122.
      66c4b79a
    • pj.fanning's avatar
      [SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON · d3147502
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent.
      SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent.
      I am deprecating the existing RDD based APIs.
      
      ## How was this patch tested?
      
      There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: pj.fanning <pj.fanning@workday.com>
      Author: PJ Fanning <pjfanning@users.noreply.github.com>
      
      Closes #16895 from pjfanning/SPARK-15615.
      d3147502
    • Xiao Li's avatar
      [SPARK-19658][SQL] Set NumPartitions of RepartitionByExpression In Parser · dc005ed5
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      Currently, if `NumPartitions` is not set in RepartitionByExpression, we will set it using `spark.sql.shuffle.partitions` during Planner. However, this is not following the general resolution process. This PR is to set it in `Parser` and then `Optimizer` can use the value for plan optimization.
      
      ### How was this patch tested?
      
      Added a test case.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16988 from gatorsmile/resolveRepartition.
      dc005ed5
    • Marcelo Vanzin's avatar
      [SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM. · 4661d30b
      Marcelo Vanzin authored
      Allow an application to use the History Server URL as the tracking
      URL in the YARN RM, so there's still a link to the web UI somewhere
      in YARN even if the driver's UI is disabled. This is useful, for
      example, if an admin wants to disable the driver UI by default for
      applications, since it's harder to secure it (since it involves non
      trivial ssl certificate and auth management that admins may not want
      to expose to user apps).
      
      This needs to be opt-in, because of the way the YARN proxy works, so
      a new configuration was added to enable the option.
      
      The YARN RM will proxy requests to live AMs instead of redirecting
      the client, so pages in the SHS UI will not render correctly since
      they'll reference invalid paths in the RM UI. The proxy base support
      in the SHS cannot be used since that would prevent direct access to
      the SHS.
      
      So, to solve this problem, for the feature to work end-to-end, a new
      YARN-specific filter was added that detects whether the requests come
      from the proxy and redirects the client appropriatly. The SHS admin has
      to add this filter manually if they want the feature to work.
      
      Tested with new unit test, and by running with the documented configuration
      set in a test cluster. Also verified the driver UI is used when it's
      enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16946 from vanzin/SPARK-19554.
      4661d30b
    • hyukjinkwon's avatar
      [SPARK-19666][SQL] Skip a property without getter in Java schema inference and... · 37112fcf
      hyukjinkwon authored
      [SPARK-19666][SQL] Skip a property without getter in Java schema inference and allow empty bean in encoder creation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix two.
      
      **Skip a property without a getter in beans**
      
      Currently, if we use a JavaBean without the getter as below:
      
      ```java
      public static class BeanWithoutGetter implements Serializable {
        private String a;
      
        public void setA(String a) {
          this.a = a;
        }
      }
      
      BeanWithoutGetter bean = new BeanWithoutGetter();
      List<BeanWithoutGetter> data = Arrays.asList(bean);
      spark.createDataFrame(data, BeanWithoutGetter.class).show();
      ```
      
      - Before
      
      It throws an exception as below:
      
      ```
      java.lang.NullPointerException
      	at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
      	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
      	at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
      ```
      
      - After
      
      ```
      ++
      ||
      ++
      ||
      ++
      ```
      
      **Supports empty bean in encoder creation**
      
      ```java
      public static class EmptyBean implements Serializable {}
      
      EmptyBean bean = new EmptyBean();
      List<EmptyBean> data = Arrays.asList(bean);
      spark.createDataset(data, Encoders.bean(EmptyBean.class)).show();
      ```
      
      - Before
      
      throws an exception as below:
      
      ```
      java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant
      	at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436)
      	at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341)
      ```
      
      - After
      
      ```
      ++
      ||
      ++
      ||
      ++
      ```
      
      ## How was this patch tested?
      
      Unit test in `JavaDataFrameSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17013 from HyukjinKwon/SPARK-19666.
      37112fcf
Loading