Skip to content
Snippets Groups Projects
  1. Jun 07, 2016
    • Herman van Hovell's avatar
      [SPARK-15789][SQL] Allow reserved keywords in most places · 91fbc880
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The parser currently does not allow the use of some SQL keywords as table or field names. This PR adds supports for all keywords as identifier. The exception to this are table aliases, in this case most keywords are allowed except for join keywords (```anti, full, inner, left, semi, right, natural, on, join, cross```) and set-operator keywords (```union, intersect, except```).
      
      ## How was this patch tested?
      I have added/move/renamed test in the catalyst `*ParserSuite`s.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13534 from hvanhovell/SPARK-15789.
      91fbc880
    • Shixiong Zhu's avatar
      [SPARK-15580][SQL] Add ContinuousQueryInfo to make ContinuousQueryListener events serializable · 0cfd6192
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds ContinuousQueryInfo to make ContinuousQueryListener events serializable in order to support writing events into the event log.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13335 from zsxwing/query-info.
      0cfd6192
    • zhonghaihua's avatar
      [SPARK-14485][CORE] ignore task finished for executor lost and removed by driver · 695dbc81
      zhonghaihua authored
      Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor.
      But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task.
      But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure. For more details, [link to jira issues SPARK-14485](https://issues.apache.org/jira/browse/SPARK-14485)
      This PR introduce a mechanism to ignore this kind of task finished.
      
      N/A
      
      Author: zhonghaihua <793507405@qq.com>
      
      Closes #12258 from zhonghaihua/ignoreTaskFinishForExecutorLostAndRemovedByDriver.
      695dbc81
    • Yanbo Liang's avatar
      [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference · 6ecedf39
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.
      
      When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.
      
      We should output a warning message and clarify in document for this condition.
      
      ## How was this patch tested?
      Document change, no unit test.
      
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12731 from yanboliang/spark-13590.
      6ecedf39
    • Sean Zhong's avatar
      [SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT... · 890baaca
      Sean Zhong authored
      [SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead
      
      ## What changes were proposed in this pull request?
      
      The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead.
      
      This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)):
      > A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query
      
      **Example**:
      
      ```
      scala> spark.sql("CREATE temporary view  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
      scala> spark.sql("select c1, c2 from my_tab7").show()
      +----+-----+
      |  c1|   c2|
      +----+-----+
      |year| make|
      |2012|Tesla|
      ...
      ```
      
      It NOW prints a **deprecation warning** if "CREATE TEMPORARY TABLE USING..." is used.
      
      ```
      scala> spark.sql("CREATE temporary table  my_tab7 (c1: String, c2: String)  USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat  OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')")
      16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13414 from clockfly/create_temp_view_using.
      890baaca
    • Marcelo Vanzin's avatar
      [SPARK-15760][DOCS] Add documentation for package-related configs. · 200f01c8
      Marcelo Vanzin authored
      While there, also document spark.files and spark.jars. Text is the
      same as the spark-submit help text with some minor adjustments.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13502 from vanzin/SPARK-15760.
      200f01c8
    • wm624@hotmail.com's avatar
      [SPARK-15684][SPARKR] Not mask startsWith and endsWith in R · 3ec4461c
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR.
      1. Remove signature in generic.R
      2. Add setMethod in column.R
      3. Add unit tests
      
      ## How was this patch tested?
      Manually test it through SparkR shell for both column data and string data, which are added into the unit test file.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13476 from wangmiao1981/start.
      3ec4461c
    • WeichenXu's avatar
      [MINOR] fix typo in documents · 1e2c9311
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      I use spell check tools checks typo in spark documents and fix them.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13538 from WeichenXu123/fix_doc_typo.
      1e2c9311
    • Sean Zhong's avatar
      [SPARK-15792][SQL] Allows operator to change the verbosity in explain output · 5f731d68
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan.
      
      Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future.
      
      **Less verbose mode:** dataframe.explain(extended = false)
      
      `output=[count(a)#85L]` is **NOT** displayed for HashAggregate.
      
      ```
      scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2")
      scala> spark.sql("select count(a) from df2").explain()
      == Physical Plan ==
      *HashAggregate(key=[], functions=[count(1)])
      +- Exchange SinglePartition
         +- *HashAggregate(key=[], functions=[partial_count(1)])
            +- LocalTableScan
      ```
      
      **Verbose mode:** dataframe.explain(extended = true)
      
      `output=[count(a)#85L]` is displayed for HashAggregate.
      
      ```
      scala> spark.sql("select count(a) from df2").explain(true)  // "output=[count(a)#85L]" is added
      ...
      == Physical Plan ==
      *HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L])
      +- Exchange SinglePartition
         +- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L])
            +- LocalTableScan
      ```
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13535 from clockfly/verbose_breakdown_2.
      5f731d68
    • Sean Zhong's avatar
      [SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema · 0e0904a2
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      This PR makes sure the typed Filter doesn't change the Dataset schema.
      
      **Before the change:**
      
      ```
      scala> val df = spark.range(0,9)
      scala> df.schema
      res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
      scala> val afterFilter = df.filter(_=>true)
      scala> afterFilter.schema   // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true.
      res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true))
      
      ```
      
      SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset.
      
      **After the change:**
      
      ```
      scala> afterFilter.schema   // schema is NOT changed.
      res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false))
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13529 from clockfly/spark-15632.
      0e0904a2
  2. Jun 06, 2016
    • Subroto Sanyal's avatar
      [SPARK-15652][LAUNCHER] Added a new State (LOST) for the listeners of SparkLauncher · c409e23a
      Subroto Sanyal authored
      ## What changes were proposed in this pull request?
      This situation can happen when the LauncherConnection gets an exception while reading through the socket and terminating silently without notifying making the client/listener think that the job is still in previous state.
      The fix force sends a notification to client that the job finished with unknown status and let client handle it accordingly.
      
      ## How was this patch tested?
      Added a unit test.
      
      Author: Subroto Sanyal <ssanyal@datameer.com>
      
      Closes #13497 from subrotosanyal/SPARK-15652-handle-spark-submit-jvm-crash.
      c409e23a
    • Imran Rashid's avatar
      [SPARK-15783][CORE] still some flakiness in these blacklist tests so ignore for now · 36d3dfa5
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      There is still some flakiness in BlacklistIntegrationSuite, so turning it off for the moment to avoid breaking more builds -- will turn it back with more fixes.
      
      ## How was this patch tested?
      
      jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13528 from squito/ignore_blacklist.
      36d3dfa5
    • Josh Rosen's avatar
      [SPARK-15764][SQL] Replace N^2 loop in BindReferences · 0b8d6949
      Josh Rosen authored
      BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n).
      
      Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups.
      
      Perf. benchmarks to follow. /cc ericl
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13505 from JoshRosen/bind-references-improvement.
      0b8d6949
    • Joseph K. Bradley's avatar
      [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public · 4c74ee8d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.
      
      ## How was this patch tested?
      
      Wrote example making use of the now-public APIs.  Compiled and ran locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #13461 from jkbradley/defaultparamswritable.
      4c74ee8d
    • Dhruve Ashar's avatar
      [SPARK-14279][BUILD] Pick the spark version from pom · fa4bc8ea
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Change the way spark picks up version information. Also embed the build information to better identify the spark version running.
      
      More context can be found here : https://github.com/apache/spark/pull/12152
      
      ## How was this patch tested?
      Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>
      
      ![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #13061 from dhruve/impr/SPARK-14279.
      fa4bc8ea
    • Zheng RuiFeng's avatar
      [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1 · 00ad4f05
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add accuracy for MulticlassMetrics
      2, deprecate overall precision,recall,f1 and recommend accuracy usage
      
      ## How was this patch tested?
      manual tests in pyspark shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
      00ad4f05
    • Yanbo Liang's avatar
      [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples · a9525282
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
      ```python
      pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
      ```
      We should use ```accuracy``` to replace ```precision``` in these examples.
      
      ## How was this patch tested?
      Offline tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13519 from yanboliang/spark-15771.
      a9525282
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'an -> a' · fd8af397
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `an -> a`
      
      Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13515 from zhengruifeng/an_a.
      fd8af397
    • Reynold Xin's avatar
      32f2f95d
    • Takeshi YAMAMURO's avatar
      [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour · b7e8d1cb
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
      Also, it explicitly sets default values for CSV options in python.
      
      ## How was this patch tested?
      Added tests in CSVSuite.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #13372 from maropu/SPARK-15585.
      b7e8d1cb
  3. Jun 05, 2016
    • Hiroshi Inoue's avatar
      [SPARK-15704][SQL] add a test case in DatasetAggregatorSuite for regression testing · 79268aa4
      Hiroshi Inoue authored
      ## What changes were proposed in this pull request?
      
      This change fixes a crash in TungstenAggregate while executing "Dataset complex Aggregator" test case due to IndexOutOfBoundsException.
      
      jira entry for detail: https://issues.apache.org/jira/browse/SPARK-15704
      
      ## How was this patch tested?
      Using existing unit tests (including DatasetBenchmark)
      
      Author: Hiroshi Inoue <inouehrs@jp.ibm.com>
      
      Closes #13446 from inouehrs/fix_aggregate.
      79268aa4
    • Josh Rosen's avatar
      [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics · 26c1089c
      Josh Rosen authored
      `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.
      
      This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13491 from JoshRosen/foldleft-to-flatmap.
      26c1089c
    • Wenchen Fan's avatar
      [SPARK-15657][SQL] RowEncoder should validate the data type of input object · 30c4774f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR improves the error handling of `RowEncoder`. When we create a `RowEncoder` with a given schema, we should validate the data type of input object. e.g. we should throw an exception when a field is boolean but is declared as a string column.
      
      This PR also removes the support to use `Product` as a valid external type of struct type.  This support is added at https://github.com/apache/spark/pull/9712, but is incomplete, e.g. nested product, product in array are both not working.  However, we never officially support this feature and I think it's ok to ban it.
      
      ## How was this patch tested?
      
      new tests in `RowEncoderSuite`.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13401 from cloud-fan/bug.
      30c4774f
    • Kai Jiang's avatar
      [MINOR][R][DOC] Fix R documentation generation instruction. · 8a911051
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      changes in R/README.md
      
      - Make step of generating SparkR document more clear.
      - link R/DOCUMENTATION.md from R/README.md
      - turn on some code syntax highlight in R/README.md
      
      ## How was this patch tested?
      local test
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #13488 from vectorijk/R-Readme.
      8a911051
    • Zheng RuiFeng's avatar
      [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi · 372fa61f
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, remove comments `:: Experimental ::` for non-experimental API
      2, add comments `:: Experimental ::` for experimental API
      3, add comments `:: DeveloperApi ::` for developerApi API
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13514 from zhengruifeng/del_experimental.
      372fa61f
    • Brett Randall's avatar
      [SPARK-15723] Fixed local-timezone-brittle test where short-timezone form "EST" is … · 4e767d0f
      Brett Randall authored
      ## What changes were proposed in this pull request?
      
      Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones.  Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723).
      
      ## How was this patch tested?
      
      Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine>
      
      and run
      
          $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone:
      
          <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine>
      
      To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes.
      
      Author: Brett Randall <javabrett@gmail.com>
      
      Closes #13462 from javabrett/SPARK-15723-SimpleDateParamSuite.
      4e767d0f
  4. Jun 04, 2016
    • Weiqing Yang's avatar
      [SPARK-15707][SQL] Make Code Neat - Use map instead of if check. · 0f307db5
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      In forType function of object RandomDataGenerator, the code following:
      if (maybeSqlTypeGenerator.isDefined){
        ....
        Some(generator)
      } else{
       None
      }
      will be changed. Instead, maybeSqlTypeGenerator.map will be used.
      
      ## How was this patch tested?
      All of the current unit tests passed.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #13448 from Sherry302/master.
      0f307db5
    • Josh Rosen's avatar
      [SPARK-15762][SQL] Cache Metadata & StructType hashCodes; use singleton Metadata.empty · 091f81e1
      Josh Rosen authored
      We should cache `Metadata.hashCode` and use a singleton for `Metadata.empty` because calculating metadata hashCodes appears to be a bottleneck for certain workloads.
      
      We should also cache `StructType.hashCode`.
      
      In an optimizer stress-test benchmark run by ericl, these `hashCode` calls accounted for roughly 40% of the total CPU time and this bottleneck was completely eliminated by the caching added by this patch.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13504 from JoshRosen/metadata-fix.
      091f81e1
    • Sean Owen's avatar
      [MINOR][BUILD] Add modernizr MIT license; specify "2014 and onwards" in license copyright · 681387b2
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Per conversation on dev list, add missing modernizr license.
      Specify "2014 and onwards" in copyright statement.
      
      ## How was this patch tested?
      
      (none required)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13510 from srowen/ModernizrLicense.
      681387b2
    • Ruifeng Zheng's avatar
      [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score · 2099e05f
      Ruifeng Zheng authored
      ## What changes were proposed in this pull request?
      1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
      2, update user guide for `mlllib.weightedFMeasure`
      
      ## How was this patch tested?
      local build
      
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #13390 from zhengruifeng/clarify_f1.
      2099e05f
    • Lianhui Wang's avatar
      [SPARK-15756][SQL] Support command 'create table stored as orcfile/parquetfile/avrofile' · 2ca563cc
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support  both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'.
      So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13500 from lianhuiwang/SPARK-15756.
      2ca563cc
  5. Jun 03, 2016
    • Subroto Sanyal's avatar
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation... · 61d729ab
      Subroto Sanyal authored
      [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation tokens to be added in current user credential.
      
      ## What changes were proposed in this pull request?
      The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.
      
      ## How was this patch tested?
      ran dev/run-tests
      
      Author: Subroto Sanyal <ssanyal@datameer.com>
      
      Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
      61d729ab
    • Davies Liu's avatar
      [SPARK-15391] [SQL] manage the temporary memory of timsort · 3074f575
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
      
      This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
      
      This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13318 from davies/fix_timsort.
      3074f575
    • Holden Karau's avatar
      [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier · 67cc89ff
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.
      
      Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )
      
      ## How was this patch tested?
      
      Doc tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
      67cc89ff
    • Andrew Or's avatar
      [SPARK-15722][SQL] Disallow specifying schema in CTAS statement · b1cc7da3
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      As of this patch, the following throws an exception because the schemas may not match:
      ```
      CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes
      ```
      but this is OK:
      ```
      CREATE TABLE students AS SELECT * FROM boxes
      ```
      
      ## How was this patch tested?
      
      SQLQuerySuite, HiveDDLCommandSuite
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13490 from andrewor14/ctas-no-column.
      b1cc7da3
    • Wenchen Fan's avatar
      [SPARK-15140][SQL] make the semantics of null input object for encoder clear · 11c83f83
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null.
      
      This PR explicitly add this constraint and throw exception if users break it.
      
      ## How was this patch tested?
      
      several new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13469 from cloud-fan/null-object.
      11c83f83
    • Xin Wu's avatar
      [SPARK-15681][CORE] allow lowercase or mixed case log level string when calling sc.setLogLevel · 28ad0f7b
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Currently `SparkContext API setLogLevel(level: String) `can not handle lower case or mixed case input string. But `org.apache.log4j.Level.toLevel` can take lowercase or mixed case.
      
      This PR is to allow case-insensitive user input for the log level.
      
      ## How was this patch tested?
      A unit testcase is added.
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #13422 from xwu0226/reset_loglevel.
      28ad0f7b
    • Wenchen Fan's avatar
      [SPARK-15547][SQL] nested case class in encoder can have different number of... · 61b80d55
      Wenchen Fan authored
      [SPARK-15547][SQL] nested case class in encoder can have different number of fields from the real schema
      
      ## What changes were proposed in this pull request?
      
      There are 2 kinds of `GetStructField`:
      
      1. resolved from `UnresolvedExtractValue`, and it will have a `name` property.
      2. created when we build deserializer expression for nested tuple, no `name` property.
      
      When we want to validate the ordinals of nested tuple, we should only catch `GetStructField` without the name property.
      
      ## How was this patch tested?
      
      new test in `EncoderResolutionSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13474 from cloud-fan/ordinal-check.
      61b80d55
    • gatorsmile's avatar
      [SPARK-15286][SQL] Make the output readable for EXPLAIN CREATE TABLE and DESC EXTENDED · eb10b481
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Before this PR, the output of EXPLAIN of following SQL is like
      
      ```SQL
      CREATE EXTERNAL TABLE extTable_with_partitions (key INT, value STRING)
      PARTITIONED BY (ds STRING, hr STRING)
      LOCATION '/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-b39a6185-8981-403b-a4aa-36fb2f4ca8a9'
      ```
      ``ExecutedCommand CreateTableCommand CatalogTable(`extTable_with_partitions`,CatalogTableType(EXTERNAL),CatalogStorageFormat(Some(/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-dd234718-e85d-4c5a-8353-8f1834ac0323),Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(key,int,true,None), CatalogColumn(value,string,true,None), CatalogColumn(ds,string,true,None), CatalogColumn(hr,string,true,None)),List(ds, hr),List(),List(),-1,,1463026413544,-1,Map(),None,None,None), false``
      
      After this PR, the output is like
      
      ```
      ExecutedCommand
      :  +- CreateTableCommand CatalogTable(
      	Table:`extTable_with_partitions`
      	Created:Thu Jun 02 21:30:54 PDT 2016
      	Last Access:Wed Dec 31 15:59:59 PST 1969
      	Type:EXTERNAL
      	Schema:[`key` int, `value` string, `ds` string, `hr` string]
      	Partition Columns:[`ds`, `hr`]
      	Storage(Location:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-a06083b8-8e88-4d07-9ff0-d6bd8d943ad3, InputFormat:org.apache.hadoop.mapred.TextInputFormat, OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat)), false
      ```
      
      This is also applicable to `DESC EXTENDED`. However, this does not have special handling for Data Source Tables. If needed, we need to move the logics of `DDLUtil`. Let me know if we should do it in this PR. Thanks! rxin liancheng
      
      #### How was this patch tested?
      Manual testing
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13070 from gatorsmile/betterExplainCatalogTable.
      eb10b481
    • Josh Rosen's avatar
      [SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods · e5269139
      Josh Rosen authored
      In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.
      
      For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).
      
      ### Perf results (from #13456 benchmarks)
      
      **Before**
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            19 /   22          0.0    19119858.0       1.0X
      10 select expressions                           23 /   25          0.0    23208774.0       0.8X
      100 select expressions                          55 /   73          0.0    54768402.0       0.3X
      1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
      2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
      ```
      
      **After**
      
      ```
      parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      1 select expressions                            15 /   21          0.0    14978203.0       1.0X
      10 select expressions                           22 /   27          0.0    22492262.0       0.7X
      100 select expressions                          48 /   64          0.0    48449834.0       0.3X
      1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
      2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
      ```
      
      ###
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #13484 from JoshRosen/treenode-productiterator-map.
      e5269139
Loading