Skip to content
Snippets Groups Projects
  1. Jun 06, 2016
    • Subroto Sanyal's avatar
      [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher · c409e23a
      Subroto Sanyal authored
      ## What changes were proposed in this pull request?
      This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state.
      The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly.
      
      ## How was this patch tested?
      Added a unit test.
      
      Author: Subroto Sanyal <ssanyal@datameer.com>
      
      Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.
      c409e23a
    • Imran Rashid's avatar
      [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now · 36d3dfa5
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes.
      
      ## How was this patch tested?
      
      jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13528 from squito/ignore_blacklist.
      36d3dfa5
    • Josh Rosen's avatar
      [SPARK-15764][SQL] Replace N^2 loop in BindReferences · 0b8d6949
      Josh Rosen authored
      BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).
      
      Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.
      
      Perf. benchmarks to follow. /cc ericl
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13505 from JoshRosen/bind-references-improvement.
      0b8d6949
    • Joseph K. Bradley's avatar
      [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public · 4c74ee8d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.
      
      ## How was this patch tested?
      
      Wrote example making use of the now-public APIs.  Compiled and ran locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13461 from jkbradley/defaultparamswritable.
      4c74ee8d
    • Dhruve Ashar's avatar
      [SPARK-14279][BUILD] Pick the spark version from pom · fa4bc8ea
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Change the way spark picks up version information. Also embed the build information to better identify the spark version running.
      
      More context can be found here : https://github.com/apache/spark/pull/12152
      
      ## How was this patch tested?
      Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>
      
      ![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #13061 from dhruve/impr/SPARK-14279.
      fa4bc8ea
    • Zheng RuiFeng's avatar
      [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1 · 00ad4f05
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add accuracy for MulticlassMetrics
      2, deprecate overall precision,recall,f1 and recommend accuracy usage
      
      ## How was this patch tested?
      manual tests in pyspark shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
      00ad4f05
    • Yanbo Liang's avatar
      [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples · a9525282
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
      ```python
      pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
      ```
      We should use ```accuracy``` to replace ```precision``` in these examples.
      
      ## How was this patch tested?
      Offline tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13519 from yanboliang/spark-15771.
      a9525282
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'an -> a' · fd8af397
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `an -> a`
      
      Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13515 from zhengruifeng/an_a.
      fd8af397
    • Reynold Xin's avatar
      32f2f95d
    • Takeshi YAMAMURO's avatar
      [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour · b7e8d1cb
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
      Also, it explicitly sets default values for CSV options in python.
      
      ## How was this patch tested?
      Added tests in CSVSuite.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13372 from maropu/SPARK-15585.
      b7e8d1cb
  2. Jun 05, 2016
    • Hiroshi Inoue's avatar
      [SPARK-15704][SQL] add a test case in DatasetAggregatorSuite for regression testing · 79268aa4
      Hiroshi Inoue authored
      ## What changes were proposed in this pull request?
      
      This change fixes a crash in TungstenAggregate while executing "Dataset complex Aggregator" test case due to IndexOutOfBoundsException.
      
      jira entry for detail: https://issues.apache.org/jira/browse/SPARK-15704
      
      ## How was this patch tested?
      Using existing unit tests (including DatasetBenchmark)
      
      Author: Hiroshi Inoue <inouehrs@jp.ibm.com>
      
      Closes #13446 from inouehrs/fix_aggregate.
      79268aa4
    • Josh Rosen's avatar
      [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics · 26c1089c
      Josh Rosen authored
      `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.
      
      This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13491 from JoshRosen/foldleft-to-flatmap.
      26c1089c
    • Wenchen Fan's avatar
      [SPARK-15657][SQL] RowEncoder should validate the data type of input object · 30c4774f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.
      
      This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.
      
      ## How was this patch tested?
      
      new tests in `RowEncoderSuite`.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13401 from cloud-fan/bug.
      30c4774f
    • Kai Jiang's avatar
      [MINOR][R][DOC] Fix R documentation generation instruction. · 8a911051
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      changes in R/README.md
      
      - Make step of generating SparkR document more clear.
      - link R/DOCUMENTATION.md from R/README.md
      - turn on some code syntax highlight in R/README.md
      
      ## How was this patch tested?
      local test
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #13488 from vectorijk/R-Readme.
      8a911051
    • Zheng RuiFeng's avatar
      [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi · 372fa61f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, remove comments `:: Experimental ::` for non-experimental API
      2, add comments `:: Experimental ::` for experimental API
      3, add comments `:: DeveloperApi ::` for developerApi API
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13514 from zhengruifeng/del_experimental.
      372fa61f
    • Brett Randall's avatar
      [SPARK-15723] Fixed local-timezone-brittle test where short-timezone form "EST" is … · 4e767d0f
      Brett Randall authored
      ## What changes were proposed in this pull request?
      
      Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).
      
      ## How was this patch tested?
      
      Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>
      
      and run
      
          $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>
      
      To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.
      
      Author: Brett Randall <javabrett@gmail.com>
      
      Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
      4e767d0f
  3. Jun 04, 2016
    • Weiqing Yang's avatar
      [SPARK-15707][SQL] Make Code Neat - Use map instead of if check. · 0f307db5
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      In forType function of object RandomDataGenerator, the code following:
      if (maybeSqlTypeGenerator.isDefined){
        ....
        Some(generator)
      } else{
       None
      }
      will be changed. Instead, maybeSqlTypeGenerator.map will be used.
      
      ## How was this patch tested?
      All of the current unit tests passed.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #13448 from Sherry302/master.
      0f307db5
    • Josh Rosen's avatar
      [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty · 091f81e1
      Josh Rosen authored
      We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.
      
      We should also cache `StructType.hashCode`.
      
      In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13504 from JoshRosen/metadata-fix.
      091f81e1
    • Sean Owen's avatar
      [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright · 681387b2
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Per conversation on dev list, add missing modernizr license.
      Specify "2014 and onwards" in copyright statement.
      
      ## How was this patch tested?
      
      (none required)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13510 from srowen/ModernizrLicense.
      681387b2
    • Ruifeng Zheng's avatar
      [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score · 2099e05f
      Ruifeng Zheng authored
      ## What changes were proposed in this pull request?
      1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
      2, update user guide for `mlllib.weightedFMeasure`
      
      ## How was this patch tested?
      local build
      
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #13390 from zhengruifeng/clarify_f1.
      2099e05f
    • Lianhui Wang's avatar
      [SPARK-15756][SQL] Support command 'create table stored as orcfile/parquetfile/avrofile' · 2ca563cc
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support  both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'.
      So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13500 from lianhuiwang/SPARK-15756.
      2ca563cc
  4. Jun 03, 2016
    • Subroto Sanyal's avatar
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation... · 61d729ab
      Subroto Sanyal authored
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation tokens to be added in current user credential.
      
      ## What changes were proposed in this pull request?
      The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.
      
      ## How was this patch tested?
      ran dev/run-tests
      
      Author: Subroto Sanyal <ssanyal@datameer.com>
      
      Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
      61d729ab
    • Davies Liu's avatar
      [SPARK-15391] [SQL] manage the temporary memory of timsort · 3074f575
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
      
      This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
      
      This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13318 from davies/fix_timsort.
      3074f575
    • Holden Karau's avatar
      [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier · 67cc89ff
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.
      
      Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )
      
      ## How was this patch tested?
      
      Doc tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
      67cc89ff
    • Andrew Or's avatar
      [SPARK-15722][SQL] Disallow specifying schema in CTAS statement · b1cc7da3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      As of this patch, the following throws an exception because the schemas may not match:
      ```
      CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
      ```
      but this is OK:
      ```
      CREATE TABLE students AS SELECT * FROM boxes
      ```
      
      ## How was this patch tested?
      
      SQLQuerySuite, HiveDDLCommandSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13490 from andrewor14/ctas-no-column.
      b1cc7da3
    • Wenchen Fan's avatar
      [SPARK-15140][SQL] make the semantics of null input object for encoder clear · 11c83f83
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.
      
      This PR explicitly add this constraint and throw exception if users break it.
      
      ## How was this patch tested?
      
      several new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13469 from cloud-fan/null-object.
      11c83f83
    • Xin Wu's avatar
      [SPARK-15681][CORE] allow lowercase or mixed case log level string when calling sc.setLogLevel · 28ad0f7b
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.
      
      This PR is to allow case-insensitive user input for the log level.
      
      ## How was this patch tested?
      A unit testcase is added.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13422 from xwu0226/reset_loglevel.
      28ad0f7b
    • Wenchen Fan's avatar
      [SPARK-15547][SQL] nested case class in encoder can have different number of... · 61b80d55
      Wenchen Fan authored
      [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
      
      ## What changes were proposed in this pull request?
      
      There are 2 kinds of `GetStructField`:
      
      1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
      2. created when we build deserializer expression for nested tuple, no `name` property.
      
      When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.
      
      ## How was this patch tested?
      
      new test in `EncoderResolutionSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13474 from cloud-fan/ordinal-check.
      61b80d55
    • gatorsmile's avatar
      [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED · eb10b481
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Before this PR, the output of EXPLAIN of following SQL is like
      
      ```SQL
      CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
      PARTITIONED BY (ds STRING, hr STRING)
      LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
      ```
      ``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``
      
      After this PR, the output is like
      
      ```
      ExecutedCommand
      :  +- CreateTableCommand CatalogTable(
      	Table:`extTable_with_partitions`
      	Created:Thu Jun 02 21:30:54 PDT 2016
      	Last Access:Wed Dec 31 15:59:59 PST 1969
      	Type:EXTERNAL
      	Schema:[`key` int, `value` string, `ds` string, `hr` string]
      	Partition Columns:[`ds`, `hr`]
      	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
      ```
      
      This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng
      
      #### How was this patch tested?
      Manual testing
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13070 from gatorsmile/betterExplainCatalogTable.
      eb10b481
    • Josh Rosen's avatar
      [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods · e5269139
      Josh Rosen authored
      In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.
      
      For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).
      
      ### Perf results (from #13456 benchmarks)
      
      **Before**
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            19 /   22          0.0    19119858.0       1.0X
      10 select expressions                           23 /   25          0.0    23208774.0       0.8X
      100 select expressions                          55 /   73          0.0    54768402.0       0.3X
      1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
      2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
      ```
      
      **After**
      
      ```
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            15 /   21          0.0    14978203.0       1.0X
      10 select expressions                           22 /   27          0.0    22492262.0       0.7X
      100 select expressions                          48 /   64          0.0    48449834.0       0.3X
      1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
      2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
      ```
      
      ###
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13484 from JoshRosen/treenode-productiterator-map.
      e5269139
    • Devaraj K's avatar
      [SPARK-15665][CORE] spark-submit --kill and --status are not working · efd3b11a
      Devaraj K authored
      ## What changes were proposed in this pull request?
      --kill and --status were not considered while handling in OptionParser and due to that it was failing. Now handling the --kill and --status options as part of OptionParser.handle.
      
      ## How was this patch tested?
      Added a test org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testCliKillAndStatus() and also I have verified these manually by running --kill and --status commands.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13407 from devaraj-kavali/SPARK-15665.
      efd3b11a
    • Ioana Delaney's avatar
      [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws... · 9e2eb13c
      Ioana Delaney authored
      [SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException
      
      ## What changes were proposed in this pull request?
      Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw
      UnsupportedOperationException exception.
      
      Problem repro:
      ```SQL
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1")
      scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2")
      scala> sql("select (select min(c1) from t2) from t1").show()
      
      java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 []
        at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215)
        at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62)
        at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45)
        at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473)
      ```
      The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down
      a scalar-subquery expression to the local tables.
      
      The solution prevents the rule to apply if Project references scalar subqueries.
      
      ## How was this patch tested?
      Added regression tests to SubquerySuite.scala
      
      Author: Ioana Delaney <ioanamdelaney@gmail.com>
      
      Closes #13418 from ioana-delaney/scalarSubV2.
      9e2eb13c
    • bomeng's avatar
      [SPARK-15737][CORE] fix jetty warning · 8fa00dd0
      bomeng authored
      ## What changes were proposed in this pull request?
      
      After upgrading Jetty to 9.2, we always see "WARN org.eclipse.jetty.server.handler.AbstractHandler: No Server set for org.eclipse.jetty.server.handler.ErrorHandler" while running any test cases.
      
      This PR will fix it.
      
      ## How was this patch tested?
      
      The existing test cases will cover it.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13475 from bomeng/SPARK-15737.
      8fa00dd0
    • Imran Rashid's avatar
      [SPARK-15714][CORE] Fix flaky o.a.s.scheduler.BlacklistIntegrationSuite · c2f0cb4f
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      BlacklistIntegrationSuite (introduced by SPARK-10372) is a bit flaky because of some race conditions:
      1. Failed jobs might have non-empty results, because the resultHandler will be invoked for successful tasks (if there are task successes before failures)
      2. taskScheduler.taskIdToTaskSetManager must be protected by a lock on taskScheduler
      
      (1) has failed a handful of jenkins builds recently.  I don't think I've seen (2) in jenkins, but I've run into with some uncommitted tests I'm working on where there are lots more tasks.
      
      While I was in there, I also made an unrelated fix to `runningTasks`in the test framework -- there was a pointless `O(n)` operation to remove completed tasks, could be `O(1)`.
      
      ## How was this patch tested?
      
      I modified the o.a.s.scheduler.BlacklistIntegrationSuite to have it run the tests 1k times on my laptop.  It failed 11 times before this change, and none with it.  (Pretty sure all the failures were problem (1), though I didn't check all of them).
      
      Also the full suite of tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13454 from squito/SPARK-15714.
      c2f0cb4f
    • Wenchen Fan's avatar
      [SPARK-15494][SQL] encoder code cleanup · 190ff274
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.
      
      1. move validation logic to analyzer instead of encoder
      2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
      3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
      4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)
      
      ## How was this patch tested?
      
      existing test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13269 from cloud-fan/clean-encoder.
      190ff274
    • Dongjoon Hyun's avatar
      [SPARK-15744][SQL] Rename two TungstenAggregation*Suites and update codgen/error messages/comments · b9fcfb3b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      For consistency, this PR updates some remaining `TungstenAggregation/SortBasedAggregate` after SPARK-15728.
      - Update a comment in codegen in `VectorizedHashMapGenerator.scala`.
      - `TungstenAggregationQuerySuite` --> `HashAggregationQuerySuite`
      - `TungstenAggregationQueryWithControlledFallbackSuite` --> `HashAggregationQueryWithControlledFallbackSuite`
      - Update two error messages in `SQLQuerySuite.scala` and `AggregationQuerySuite.scala`.
      - Update several comments.
      
      ## How was this patch tested?
      
      Manual (Only comment changes and test suite renamings).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13487 from dongjoon-hyun/SPARK-15744.
      b9fcfb3b
    • Sameer Agarwal's avatar
      [SPARK-15745][SQL] Use classloader's getResource() for reading resource files in HiveTests · f7288e16
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This is a cleaner approach in general but my motivation behind this change in particular is to be able to run these tests from anywhere without relying on system properties.
      
      ## How was this patch tested?
      
      Test only change
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13489 from sameeragarwal/resourcepath.
      f7288e16
    • Xin Wu's avatar
      [SPARK-14959][SQL] handle partitioned table directories in distributed filesystem · 76aa45d3
      Xin Wu authored
      ## What changes were proposed in this pull request?
      ##### The root cause:
      When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens.
      
      This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects.
      
      ## How was this patch tested?
      Regtest is run. Manual test:
      ```
      scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show
      +-----+---+
      | text| id|
      +-----+---+
      |hello|  0|
      |world|  0|
      |hello|  1|
      |there|  1|
      +-----+---+
      
             spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show
      +-----+---+
      | text| id|
      +-----+---+
      |hello|  0|
      |world|  0|
      |hello|  1|
      |there|  1|
      +-----+---+
      ```
      I also tried it with 2 level of partitioning.
      I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13463 from xwu0226/SPARK-14959.
      76aa45d3
    • Sean Zhong's avatar
      [SPARK-15733][SQL] Makes the explain output less verbose by hiding some... · 6dde2740
      Sean Zhong authored
      [SPARK-15733][SQL] Makes the explain output less verbose by hiding some verbose output like None, null, empty List, and etc.
      
      ## What changes were proposed in this pull request?
      
      This PR makes the explain output less verbose by hiding some verbose output like `None`, `null`, empty List `[]`, empty set `{}`, and etc.
      
      **Before change**:
      
      ```
      == Physical Plan ==
      ExecutedCommand
      :  +- ShowTablesCommand None, None
      ```
      
      **After change**:
      
      ```
      == Physical Plan ==
      ExecutedCommand
      :  +- ShowTablesCommand
      ```
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13470 from clockfly/verbose_breakdown_4.
      6dde2740
  5. Jun 02, 2016
    • Eric Liang's avatar
      [SPARK-15724] Add benchmarks for performance over wide schemas · 901b2e69
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This adds microbenchmarks for tracking performance of queries over very wide or deeply nested DataFrames. It seems performance degrades when DataFrames get thousands of columns wide or hundreds of fields deep.
      
      ## How was this patch tested?
      
      Current results included.
      
      cc rxin JoshRosen
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13456 from ericl/sc-3468.
      901b2e69
Loading