Skip to content
Snippets Groups Projects
  1. Aug 22, 2017
    • jerryshao's avatar
      [SPARK-20641][CORE] Add missing kvstore module in Laucher and SparkSubmit code · 3ed1ae10
      jerryshao authored
      There're two code in Launcher and SparkSubmit will will explicitly list all the Spark submodules, newly added kvstore module is missing in this two parts, so submitting a minor PR to fix this.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #19014 from jerryshao/missing-kvstore.
      3ed1ae10
    • gatorsmile's avatar
      [SPARK-21803][TEST] Remove the HiveDDLCommandSuite · be72b157
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19015 from gatorsmile/combineDDL.
      be72b157
    • Andrew Ray's avatar
      [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation · 5c9b3017
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
      
      This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
      
      ## How was this patch tested?
      
      Modified and additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18786 from aray/summary-r.
      5c9b3017
  2. Aug 21, 2017
    • Kyle Kelley's avatar
      [SPARK-21070][PYSPARK] Attempt to update cloudpickle again · 751f5133
      Kyle Kelley authored
      ## What changes were proposed in this pull request?
      
      Based on https://github.com/apache/spark/pull/18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.
      
      Some notable changes:
      * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
      * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
      * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
      * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
      * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
      * Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
      * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
      * Support method descriptor (cloudpipe/cloudpickle#46)
      * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
      * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
      * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
      * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)
      
      ## How was this patch tested?
      
      Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: Kyle Kelley <rgbkrk@gmail.com>
      
      Closes #18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
      751f5133
    • Yanbo Liang's avatar
      [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. · c108a5d3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly.
      
      ## How was this patch tested?
      Existing tests, only adding some comments in code.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18992 from yanboliang/SPARK-19762.
      c108a5d3
    • Marcelo Vanzin's avatar
      [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. · 84b5b16e
      Marcelo Vanzin authored
      For Hive tables, the current "replace the schema" code is the correct
      path, except that an exception in that path should result in an error, and
      not in retrying in a different way.
      
      For data source tables, Spark may generate a non-compatible Hive table;
      but for that to work with Hive 2.1, the detection of data source tables needs
      to be fixed in the Hive client, to also consider the raw tables used by code
      such as `alterTableSchema`.
      
      Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18849 from vanzin/SPARK-21617.
      84b5b16e
    • Yuming Wang's avatar
      [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown verification back. · ba843292
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      The previous PR(https://github.com/apache/spark/pull/19000) removed filter pushdown verification, This PR add them back.
      
      ## How was this patch tested?
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #19002 from wangyum/SPARK-21790-follow-up.
      ba843292
    • Nick Pentreath's avatar
      [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher · 988b84d7
      Nick Pentreath authored
      Add Python API for `FeatureHasher` transformer.
      
      ## How was this patch tested?
      
      New doc test.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
      988b84d7
    • Sean Owen's avatar
      [SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." · b3a07526
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Reduce 'Skipping partitions' message to debug
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19010 from srowen/SPARK-21718.
      b3a07526
    • Sergey Serebryakov's avatar
      [SPARK-21782][CORE] Repartition creates skews when numPartitions is a power of 2 · 77d046ec
      Sergey Serebryakov authored
      ## Problem
      When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
      
      ## What changes were proposed in this pull request?
      Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash them with `scala.util.hashing.byteswap32()`.
      
      ## How was this patch tested?
      `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`
      
      Author: Sergey Serebryakov <sserebryakov@tesla.com>
      
      Closes #18990 from megaserg/repartition-skew.
      77d046ec
  3. Aug 20, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths... · 28a6cca7
      Liang-Chi Hsieh authored
      [SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed
      
      ## What changes were proposed in this pull request?
      
      Fix a typo in test.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #19005 from viirya/SPARK-21721-followup.
      28a6cca7
    • hyukjinkwon's avatar
      [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build · 41e0eb71
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to install `mkdocs` by `pip install` if missing in the path. Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console.
      
      It also adds `mkdocs` as requirements in `docs/README.md`.
      
      ## How was this patch tested?
      
      I manually ran `jekyll build` under `docs` directory after manually removing `mkdocs` via `pip uninstall mkdocs`.
      
      Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) where I built Spark few times but never built documentation before and `mkdocs` is not installed.
      
      ```
      ...
      Moving back into docs dir.
      Moving to SQL directory and building docs.
      Missing mkdocs in your path, trying to install mkdocs for SQL documentation generation.
      Collecting mkdocs
        Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
          100% |████████████████████████████████| 1.2MB 574kB/s
      Requirement already satisfied: PyYAML>=3.10 in /usr/lib64/python2.7/site-packages (from mkdocs)
      Collecting livereload>=2.5.1 (from mkdocs)
        Downloading livereload-2.5.1-py2-none-any.whl
      Collecting tornado>=4.1 (from mkdocs)
        Downloading tornado-4.5.1.tar.gz (483kB)
          100% |████████████████████████████████| 491kB 1.4MB/s
      Collecting Markdown>=2.3.1 (from mkdocs)
        Downloading Markdown-2.6.9.tar.gz (271kB)
          100% |████████████████████████████████| 276kB 2.4MB/s
      Collecting click>=3.3 (from mkdocs)
        Downloading click-6.7-py2.py3-none-any.whl (71kB)
          100% |████████████████████████████████| 71kB 2.8MB/s
      Requirement already satisfied: Jinja2>=2.7.1 in /usr/lib/python2.7/site-packages (from mkdocs)
      Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs)
      Requirement already satisfied: backports.ssl_match_hostname in /usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs)
      Collecting singledispatch (from tornado>=4.1->mkdocs)
        Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
      Collecting certifi (from tornado>=4.1->mkdocs)
        Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
          100% |████████████████████████████████| 358kB 2.1MB/s
      Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
        Downloading backports_abc-0.5-py2.py3-none-any.whl
      Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs)
      Building wheels for collected packages: tornado, Markdown
        Running setup.py bdist_wheel for tornado ... done
        Stored in directory: /root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
        Running setup.py bdist_wheel for Markdown ... done
        Stored in directory: /root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
      Successfully built tornado Markdown
      Installing collected packages: singledispatch, certifi, backports-abc, tornado, livereload, Markdown, click, mkdocs
      Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1
      Generating markdown files for SQL documentation.
      Generating HTML files for SQL documentation.
      INFO    -  Cleaning site directory
      INFO    -  Building documentation to directory: .../spark/sql/site
      Moving back into docs dir.
      Making directory api/sql
      cp -r ../sql/site/. api/sql
                  Source: .../spark/docs
             Destination: .../spark/docs/_site
            Generating...
                          done.
       Auto-regeneration: disabled. Use --watch to enable.
       ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18984 from HyukjinKwon/sql-doc-mkdocs.
      41e0eb71
    • Cédric Pelvet's avatar
      [MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression · 73e04ecc
      Cédric Pelvet authored
      ## What changes were proposed in this pull request?
      
      The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol.
      
      ## How was this patch tested?
      
      Manually.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Cédric Pelvet <cedric.pelvet@gmail.com>
      
      Closes #18980 from sharp-pixel/master.
      73e04ecc
  4. Aug 19, 2017
  5. Aug 18, 2017
    • Andrew Ray's avatar
      [SPARK-21566][SQL][PYTHON] Python method for summary · 10be0184
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Adds the recently added `summary` method to the python dataframe interface.
      
      ## How was this patch tested?
      
      Additional inline doctests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18762 from aray/summary-py.
      10be0184
    • Andrew Ash's avatar
      [MINOR][TYPO] Fix typos: runnning and Excecutors · a2db5c57
      Andrew Ash authored
      ## What changes were proposed in this pull request?
      
      Fix typos
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #18996 from ash211/patch-2.
      a2db5c57
    • Wenchen Fan's avatar
      [SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak · 7880909c
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18993 from cloud-fan/bug.
      7880909c
    • Masha Basmanova's avatar
      [SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes · 23ea8980
      Masha Basmanova authored
      ## What changes were proposed in this pull request?
      
      Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows.
      
      When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified.
      
      For example, table t has 4 partitions with the following specs:
      
      * Partition1: (ds='2008-04-08', hr=11)
      * Partition2: (ds='2008-04-08', hr=12)
      * Partition3: (ds='2008-04-09', hr=11)
      * Partition4: (ds='2008-04-09', hr=12)
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3.
      
      'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4.
      
      'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions.
      
      When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes.
      
      The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command.
      
      ## How was this patch tested?
      
      Added tests.
      
      Author: Masha Basmanova <mbasmanova@fb.com>
      
      Closes #18421 from mbasmanova/mbasmanova-analyze-partition.
      23ea8980
    • Reynold Xin's avatar
      [SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java · 07a2b873
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed.
      
      ## How was this patch tested?
      Tested manually. Not sure yet if we should add a test case for just this wrapper ...
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18988 from rxin/SPARK-21778.
      07a2b873
    • donnyzone's avatar
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is... · 310454be
      donnyzone authored
      [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
      
      This issue is caused by introducing TimeZoneAwareExpression.
      When the **Cast** expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase.
      
      However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases,  `NoSuchElementException: None.get` will be thrown for TimestampType.
      
      This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
      
      ## How was this patch tested?
      
      unit test
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #18960 from DonnyZone/spark-21739.
      310454be
  6. Aug 17, 2017
    • gatorsmile's avatar
      [SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite · 2caaed97
      gatorsmile authored
      ## What changes were proposed in this pull request?
      Decimal is a logical type of AVRO. We need to ensure the support of Hive's AVRO serde works well in Spark
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18977 from gatorsmile/addAvroTest.
      2caaed97
    • Jen-Ming Chung's avatar
      [SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type · 7ab95188
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      ``` scala
      scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
      ...
      java.lang.NullPointerException
      	at ...
      ```
      
      Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names.
      
      ## How was this patch tested?
      Added unit test.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #18930 from jmchung/SPARK-21677.
      7ab95188
    • ArtRand's avatar
      [SPARK-16742] Mesos Kerberos Support · bfdc361e
      ArtRand authored
      ## What changes were proposed in this pull request?
      
      Add Kerberos Support to Mesos.   This includes kinit and --keytab support, but does not include delegation token renewal.
      
      ## How was this patch tested?
      
      Manually against a Secure DC/OS Apache HDFS cluster.
      
      Author: ArtRand <arand@soe.ucsc.edu>
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #18519 from mgummelt/SPARK-16742-kerberos.
      bfdc361e
    • Takeshi Yamamuro's avatar
      [SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent · 6aad02d0
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent.  If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order.
      
      ## How was this patch tested?
      Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18959 from maropu/SPARK-18394.
      6aad02d0
    • gatorsmile's avatar
      [SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true · ae9e4247
      gatorsmile authored
      ## What changes were proposed in this pull request?
      When running IntelliJ, we are unable to capture the exception of memory leak detection.
      > org.apache.spark.executor.Executor: Managed memory leak detected
      
      Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when building the SparkSession, instead of reading it from system properties.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.
      ae9e4247
    • Kent Yao's avatar
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for... · b83b502c
      Kent Yao authored
      [SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for reusing CliSessionState
      
      ## What changes were proposed in this pull request?
      
      Set isolated to false while using builtin hive jars and `SessionState.get` returns a `CliSessionState` instance.
      
      ## How was this patch tested?
      
      1 Unit Tests
      2 Manually verified: `hive.exec.strachdir` was only created once because of reusing cliSessionState
      ```java
      ➜  spark git:(SPARK-21428) ✗ bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin
      
      log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
      log4j:WARN Please initialize the log4j system properly.
      log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
      17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called
      17/07/16 23:59:28 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
      17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
      17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
      17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore
      17/07/16 23:59:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
      17/07/16 23:59:31 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore
      17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since config is empty
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_all_databases
      17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=*
      17/07/16 23:59:32 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*
      17/07/16 23:59:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370
      17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db
      17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT
      17/07/16 23:59:32 INFO SparkContext: Submitted application: SparkSQL::10.0.0.8
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent
      17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to:
      17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(Kent); groups with view permissions: Set(); users  with modify permissions: Set(Kent); groups with modify permissions: Set()
      17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on port 51889.
      17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker
      17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
      17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at /private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5
      17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
      17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator
      17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
      17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.8:4040
      17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host localhost
      17/07/16 23:59:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890.
      17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on 10.0.0.8:51890
      17/07/16 23:59:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
      17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.8, 51890, None)
      17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/Kent/Documents/spark/spark-warehouse').
      17/07/16 23:59:34 INFO SharedState: Warehouse path is 'file:/Users/Kent/Documents/spark/spark-warehouse'.
      17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: default
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp
      17/07/16 23:59:34 INFO audit: ugi=Kent	ip=unknown-ip-addr	cmd=get_database: global_temp
      17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
      17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
      17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
      spark-sql>
      
      ```
      cc cloud-fan gatorsmile
      
      Author: Kent Yao <yaooqinn@hotmail.com>
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      
      Closes #18648 from yaooqinn/SPARK-21428.
      b83b502c
    • Hideaki Tanaka's avatar
      [SPARK-21642][CORE] Use FQDN for DRIVER_HOST_ADDRESS instead of ip address · d695a528
      Hideaki Tanaka authored
      ## What changes were proposed in this pull request?
      
      The patch lets spark web ui use FQDN as its hostname instead of ip address.
      
      In current implementation, ip address of a driver host is set to DRIVER_HOST_ADDRESS. This becomes a problem when we enable SSL using "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" properties. When we configure these properties, spark web ui is launched with SSL enabled and the HTTPS server is configured with the custom SSL certificate you configured in these properties.
      In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception when the client accesses the spark web ui because the client fails to verify the SSL certificate (Common Name of the SSL cert does not match with DRIVER_HOST_ADDRESS).
      
      To avoid the exception, we should use FQDN of the driver host for DRIVER_HOST_ADDRESS.
      
      Error message that client gets when the client accesses spark web ui:
      javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> doesn't match any of the subject alternative names: []
      
      ## How was this patch tested?
      manual tests
      
      Author: Hideaki Tanaka <tanakah@amazon.com>
      
      Closes #18846 from thideeeee/SPARK-21642.
      d695a528
    • Wenchen Fan's avatar
      [SPARK-21743][SQL] top-most limit should not cause memory leak · a45133b8
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For top-most limit, we will use a special operator to execute it: `CollectLimitExec`.
      
      `CollectLimitExec` will retrieve `n`(which is the limit) rows from each partition of the child plan output, see https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311. It's very likely that we don't exhaust the child plan output.
      
      This is fine when whole-stage-codegen is off, as child plan will release the resource via task completion listener. However, when whole-stage codegen is on, the resource can only be released if all output is consumed.
      
      To fix this memory leak, one simple approach is, when `CollectLimitExec` retrieve `n` rows from child plan output, child plan output should only have `n` rows, then the output is exhausted and resource is released. This can be done by wrapping child plan with `LocalLimit`
      
      ## How was this patch tested?
      
      a regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18955 from cloud-fan/leak.
      a45133b8
  7. Aug 16, 2017
    • Eyal Farago's avatar
      [SPARK-3151][BLOCK MANAGER] DiskStore.getBytes fails for files larger than 2GB · b8ffb510
      Eyal Farago authored
      ## What changes were proposed in this pull request?
      introduced `DiskBlockData`, a new implementation of `BlockData` representing a whole file.
      this is somehow related to [SPARK-6236](https://issues.apache.org/jira/browse/SPARK-6236) as well
      
      This class follows the implementation of `EncryptedBlockData` just without the encryption. hence:
      * `toInputStream` is implemented using a `FileInputStream` (todo: encrypted version actually uses `Channels.newInputStream`, not sure if it's the right choice for this)
      * `toNetty` is implemented in terms of `io.netty.channel.DefaultFileRegion`
      * `toByteBuffer` fails for files larger than 2GB (same behavior of the original code, just postponed a bit), it also respects the same configuration keys defined by the original code to choose between memory mapping and simple file read.
      
      ## How was this patch tested?
      added test to DiskStoreSuite and MemoryManagerSuite
      
      Author: Eyal Farago <eyal@nrgene.com>
      
      Closes #18855 from eyalfa/SPARK-3151.
      b8ffb510
    • Peng Meng's avatar
      [SPARK-21680][ML][MLLIB] optimize Vector compress · a0345cbe
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
      This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
      When the length of the vector is large, there is significant performance difference between this two method.
      
      ## How was this patch tested?
      
      The existing UT
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18899 from mpjlu/optVectorCompress.
      a0345cbe
    • Marco Gaido's avatar
      [SPARK-21738] Thriftserver doesn't cancel jobs when session is closed · 7add4e98
      Marco Gaido authored
      ## What changes were proposed in this pull request?
      
      When a session is closed the Thriftserver doesn't cancel the jobs which may still be running. This is a huge waste of resources.
      This PR address the problem canceling the pending jobs when a session is closed.
      
      ## How was this patch tested?
      
      The patch was tested manually.
      
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #18951 from mgaido91/SPARK-21738.
      7add4e98
    • 10129659's avatar
      [SPARK-21603][SQL] The wholestage codegen will be much slower then that is... · 1cce1a3b
      10129659 authored
      [SPARK-21603][SQL] The wholestage codegen will be much slower then that is closed when the function is too long
      
      ## What changes were proposed in this pull request?
      Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by
      spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT  optimizing.
      A benchmark test result is 10x slower when the generated function is too long :
      
      ignore("max function length of wholestagecodegen") {
          val N = 20 << 15
      
          val benchmark = new Benchmark("max function length of wholestagecodegen", N)
          def f(): Unit = sparkSession.range(N)
            .selectExpr(
              "id",
              "(id & 1023) as k1",
              "cast(id & 1023 as double) as k2",
              "cast(id & 1023 as int) as k3",
              "case when id > 100 and id <= 200 then 1 else 0 end as v1",
              "case when id > 200 and id <= 300 then 1 else 0 end as v2",
              "case when id > 300 and id <= 400 then 1 else 0 end as v3",
              "case when id > 400 and id <= 500 then 1 else 0 end as v4",
              "case when id > 500 and id <= 600 then 1 else 0 end as v5",
              "case when id > 600 and id <= 700 then 1 else 0 end as v6",
              "case when id > 700 and id <= 800 then 1 else 0 end as v7",
              "case when id > 800 and id <= 900 then 1 else 0 end as v8",
              "case when id > 900 and id <= 1000 then 1 else 0 end as v9",
              "case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
              "case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
              "case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
              "case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
              "case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
              "case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
              "case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
              "case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
              "case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
            .groupBy("k1", "k2", "k3")
            .sum()
            .collect()
      
          benchmark.addCase(s"codegen = F") { iter =>
            sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
            f()
          }
      
          benchmark.addCase(s"codegen = T") { iter =>
            sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
            sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000")
            f()
          }
      
          benchmark.run()
      
          /*
          Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
          Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
          max function length of wholestagecodegen: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
          ------------------------------------------------------------------------------------------------
          codegen = F                                    443 /  507          1.5         676.0       1.0X
          codegen = T                                   3279 / 3283          0.2        5002.6       0.1X
           */
        }
      
      ## How was this patch tested?
      Run the unit test
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #18810 from eatoncys/codegen.
      1cce1a3b
    • John Lee's avatar
      [SPARK-21656][CORE] spark dynamic allocation should not idle timeout executors... · adf005da
      John Lee authored
      [SPARK-21656][CORE] spark dynamic allocation should not idle timeout executors when tasks still to run
      
      ## What changes were proposed in this pull request?
      
      Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer than the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run.
      We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run.
      
      ## How was this patch tested?
      
      Tested by manually adding executors to `executorsIdsToBeRemoved` list and seeing if those executors were removed when there are a lot of tasks and a high `numExecutorsTarget` value.
      
      Code used
      
      In  `ExecutorAllocationManager.start()`
      
      ```
          start_time = clock.getTimeMillis()
      ```
      
      In `ExecutorAllocationManager.schedule()`
      ```
          val executorIdsToBeRemoved = ArrayBuffer[String]()
          if ( now > start_time + 1000 * 60 * 2) {
            logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
            start_time +=  1000 * 60 * 100
            var counter = 0
            for (x <- executorIds) {
              counter += 1
              if (counter == 2) {
                counter = 0
                executorIdsToBeRemoved += x
              }
            }
          }
      
      Author: John Lee <jlee2@yahoo-inc.com>
      
      Closes #18874 from yoonlee95/SPARK-21656.
      adf005da
    • Nick Pentreath's avatar
      [SPARK-13969][ML] Add FeatureHasher transformer · 0bb8d1f3
      Nick Pentreath authored
      This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
      
      The transformer operates on multiple input columns in one pass. Current behavior is:
      * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value`
      * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
      * For hash collisions, feature values will be summed
      * `null` (missing) values are ignored
      
      The following dataframe illustrates the basic semantics:
      ```
      +---+------+-----+---------+------+-----------------------------------------+
      |int|double|float|stringNum|string|features                                 |
      +---+------+-----+---------+------+-----------------------------------------+
      |3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
      |6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
      +---+------+-----+---------+------+-----------------------------------------+
      ```
      
      ## How was this patch tested?
      
      New unit tests and manual experiments.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #18513 from MLnick/FeatureHasher.
      0bb8d1f3
    • Jan Vrsovsky's avatar
      [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) · 8321c141
      Jan Vrsovsky authored
      ## What changes were proposed in this pull request?
      
      Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon
      
      (Maybe the usage should be forbidden when writing, in a major version change?).
      
      ## How was this patch tested?
      
      Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz>
      
      Closes #18872 from ProtD/master.
      8321c141
    • Dongjoon Hyun's avatar
      [SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 · 8c54f1eb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4.
      
      - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
      - Maintainability: Reduce the Hive dependency and can remove old legacy code later.
      
      Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too.
      - Usability: User can use ORC data sources without hive module, i.e, -Phive.
      - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark.
      
      ## How was this patch tested?
      
      Pass the jenkins.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18640 from dongjoon-hyun/SPARK-21422.
      8c54f1eb
  8. Aug 15, 2017
    • WeichenXu's avatar
      [SPARK-19634][ML] Multivariate summarizer - dataframes API · 07549b20
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.
      
      ## How was this patch tested?
      
      Testcases added.
      
      ## Performance
      Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan
      
      ### Performance data
      
      (test on my laptop, use 2 partitions. tries out = 20, warm up = 10)
      
      The unit of test results is records/milliseconds (higher is better)
      
      Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
      ----|------|----|---|----|----
      Dataframe | 15149  | 7441 | 2118 | 224 | 21
      RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
      raw RDD | 53931  | 20683 | 3966 | 528 | 53
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
      07549b20
    • Nicholas Chammas's avatar
      [SPARK-21712][PYSPARK] Clarify type error for Column.substr() · 96608310
      Nicholas Chammas authored
      Proposed changes:
      * Clarify the type error that `Column.substr()` gives.
      
      Test plan:
      * Tested this manually.
      * Test code:
          ```python
          from pyspark.sql.functions import col, lit
          spark.createDataFrame([['nick']], schema=['name']).select(col('name').substr(0, lit(1)))
          ```
      * Before:
          ```
          TypeError: Can not mix the type
          ```
      * After:
          ```
          TypeError: startPos and length must be the same type. Got <class 'int'> and
          <class 'pyspark.sql.column.Column'>, respectively.
          ```
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #18926 from nchammas/SPARK-21712-substr-type-error.
      96608310
    • Xingbo Jiang's avatar
      [MINOR] Fix a typo in the method name `UserDefinedFunction.asNonNullabe` · 42b9eda8
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      The method name `asNonNullabe` should be `asNonNullable`.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18952 from jiangxb1987/typo.
      42b9eda8
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
Loading