Skip to content
Snippets Groups Projects
  1. Dec 13, 2016
    • jiangxingbo's avatar
      [SPARK-17932][SQL][FOLLOWUP] Change statement `SHOW TABLES EXTENDED` to `SHOW TABLE EXTENDED` · 5572ccf8
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Change the statement `SHOW TABLES [EXTENDED] [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards'] [PARTITION(partition_spec)]` to the following statements:
      
      - SHOW TABLES [(IN|FROM) database_name] [[LIKE] 'identifier_with_wildcards']
      - SHOW TABLE EXTENDED [(IN|FROM) database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)]
      
      After this change, the statements `SHOW TABLE/SHOW TABLES` have the same syntax with that HIVE has.
      
      ## How was this patch tested?
      Modified the test sql file `show-tables.sql`;
      Modified the test suite `DDLSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16262 from jiangxb1987/show-table-extended.
      5572ccf8
    • Marcelo Vanzin's avatar
      [SPARK-18835][SQL] Don't expose Guava types in the JavaTypeInference API. · f280ccf4
      Marcelo Vanzin authored
      This avoids issues during maven tests because of shading.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16260 from vanzin/SPARK-18835.
      f280ccf4
    • Shixiong Zhu's avatar
      [SPARK-13747][CORE] Fix potential ThreadLocal leaks in RPC when using ForkJoinPool · fb3081d3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Some places in SQL may call `RpcEndpointRef.askWithRetry` (e.g., ParquetFileFormat.buildReader -> SparkContext.broadcast -> ... -> BlockManagerMaster.updateBlockInfo -> RpcEndpointRef.askWithRetry), which will finally call `Await.result`. It may cause `java.lang.IllegalArgumentException: spark.sql.execution.id is already set` when running in Scala ForkJoinPool.
      
      This PR includes the following changes to fix this issue:
      
      - Remove `ThreadUtils.awaitResult`
      - Rename `ThreadUtils. awaitResultInForkJoinSafely` to `ThreadUtils.awaitResult`
      - Replace `Await.result` in RpcTimeout with `ThreadUtils.awaitResult`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16230 from zsxwing/fix-SPARK-13747.
      fb3081d3
    • Wenchen Fan's avatar
      [SPARK-18675][SQL] CTAS for hive serde table should work for all hive versions · d53f18ca
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Before hive 1.1, when inserting into a table, hive will create the staging directory under a common scratch directory. After the writing is finished, hive will simply empty the table directory and move the staging directory to it.
      
      After hive 1.1, hive will create the staging directory under the table directory, and when moving staging directory to table directory, hive will still empty the table directory, but will exclude the staging directory there.
      
      In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which means we will always create the staging directory under the table directory, no matter what the hive version is. This causes problems if the hive version is prior to 1.1, because the staging directory will be removed by hive when hive is trying to empty the table directory.
      
      This PR copies the code from hive 0.13, so that we have 2 branches to create staging directory. If hive version is prior to 1.1, we'll go to the old style branch(i.e. create the staging directory under a common scratch directory), else, go to the new style branch(i.e. create the staging directory under the table directory)
      
      ## How was this patch tested?
      
      new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16104 from cloud-fan/hive-0.13.
      d53f18ca
    • Jacek Laskowski's avatar
      [MINOR][CORE][SQL] Remove explicit RDD and Partition overrides · 096f868b
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      I **believe** that I _only_ removed duplicated code (that adds nothing but noise). I'm gonna remove the comment after Jenkins has built the changes with no issues and Spark devs has agreed to include the changes.
      
      Remove explicit `RDD` and `Partition` overrides (that turn out code duplication)
      
      ## How was this patch tested?
      
      Local build. Awaiting Jenkins.
      
      …cation)
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #16145 from jaceklaskowski/rdd-overrides-removed.
      Unverified
      096f868b
    • Andrew Ray's avatar
      [SPARK-18717][SQL] Make code generation for Scala Map work with immutable.Map also · 46d30ac4
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
      
      ## How was this patch tested?
      
      Additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16161 from aray/fix-map-codegen.
      46d30ac4
    • wm624@hotmail.com's avatar
      [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes · 2aa16d03
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work.
      
      ## How was this patch tested?
      
      Manual build html. Please see attached image for the result.
      ![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16222 from wangmiao1981/veg.
      2aa16d03
    • Shixiong Zhu's avatar
      [SPARK-18796][SS] StreamingQueryManager should not block when starting a query · 417e45c5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Major change in this PR:
      - Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query.
      
      Minor changes:
      - Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name.
      - Add missing docs in `StreamingQueryListener`
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16220 from zsxwing/SPARK-18796.
      417e45c5
  2. Dec 12, 2016
    • Marcelo Vanzin's avatar
      [SPARK-18773][CORE] Make commons-crypto config translation consistent. · bc59951b
      Marcelo Vanzin authored
      This change moves the logic that translates Spark configuration to
      commons-crypto configuration to the network-common module. It also
      extends TransportConf and ConfigProvider to provide the necessary
      interfaces for the translation to work.
      
      As part of the change, I removed SystemPropertyConfigProvider, which
      was mostly used as an "empty config" in unit tests, and adjusted the
      very few tests that required a specific config.
      
      I also changed the config keys for AES encryption to live under the
      "spark.network." namespace, which is more correct than their previous
      names under "spark.authenticate.".
      
      Tested via existing unit test.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16200 from vanzin/SPARK-18773.
      bc59951b
    • Felix Cheung's avatar
      [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots · 8a51cfdc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL`
      
      ## How was this patch tested?
      
      unit test, manually testing
      - snapshot build url
        - download when spark jar not cached
        - when spark jar is cached
      - RC build url
        - download when spark jar not cached
        - when spark jar is cached
      - multiple cached spark versions
      - starting with sparkR shell
      
      To use this,
      ```
      SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R
      ```
      then in R,
      ```
      library(SparkR) # or specify lib.loc
      sparkR.session()
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16248 from felixcheung/rinstallurl.
      8a51cfdc
    • Yuming Wang's avatar
      [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int · 90abfd15
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server.
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16122 from wangyum/SPARK-18681.
      90abfd15
    • Marcelo Vanzin's avatar
      [SPARK-18752][HIVE] isSrcLocal" value should be set from user query. · 476b34c2
      Marcelo Vanzin authored
      The value of the "isSrcLocal" parameter passed to Hive's loadTable and
      loadPartition methods needs to be set according to the user query (e.g.
      "LOAD DATA LOCAL"), and not the current code that tries to guess what
      it should be.
      
      For existing versions of Hive the current behavior is probably ok, but
      some recent changes in the Hive code changed the semantics slightly,
      making code that sets "isSrcLocal" to "true" incorrectly to do the
      wrong thing. It would end up moving the parent directory of the files
      into the final location, instead of the file themselves, resulting
      in a table that cannot be read.
      
      I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
      both in local and non-local mode, since the semantics are slightly different.
      The tests include a few new checks to make sure the semantics follow
      what Hive describes in its documentation.
      
      Tested with existing unit tests and also ran some Hive integration tests
      with a version of Hive containing the changes that surfaced the problem.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16179 from vanzin/SPARK-18752.
      476b34c2
    • meknio's avatar
      [SPARK-16297][SQL] Fix mapping Microsoft SQLServer dialect · bf42c2db
      meknio authored
      The problem is if it is run with no fix throws an exception and causes the following error:
      
        "Cannot specify a column width on data type bit."
      
      The problem stems from the fact that the "java.sql.types.BIT" type is mapped as BIT[n] that really must be mapped as BIT.
      This concerns the type Boolean.
      
      As for the type String with maximum length of characters it must be mapped as VARCHAR (MAX) instead of TEXT which is a type deprecated in SQLServer.
      
      Here is the list of mappings for SQL Server:
      https://msdn.microsoft.com/en-us/library/ms378878(v=sql.110).aspx
      
      Closes #13944 from meknio/master.
      bf42c2db
    • Steve Loughran's avatar
      [SPARK-15844][CORE] HistoryServer doesn't come up if spark.authenticate = true · 586d1982
      Steve Loughran authored
      ## What changes were proposed in this pull request?
      
      During history server startup, the spark configuration is examined. If security.authentication is
      set, log at debug and set the value to false, so that {{SecurityManager}} can be created.
      
      ## How was this patch tested?
      
      A new test in `HistoryServerSuite` sets the `spark.authenticate` property to true, tries to create a security manager via a new package-private method `HistoryServer.createSecurityManager(SparkConf)`. This is the method used in `HistoryServer.main`. All other instantiations of a security manager in `HistoryServerSuite` have been switched to the new method, for consistency with the production code.
      
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #13579 from steveloughran/history/SPARK-15844-security.
      586d1982
    • Bill Chambers's avatar
      [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed · 70ffff21
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      This PR clarifies where accumulators will be displayed.
      
      ## How was this patch tested?
      
      No testing.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      
      Closes #16180 from anabranch/improve-acc-docs.
      Unverified
      70ffff21
    • Tyson Condie's avatar
      [SPARK-18790][SS] Keep a general offset history of stream batches · 83a42897
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches:
      the offsets that are present in each batch
      versions of the state store
      the files lists stored for the FileStreamSource
      the metadata log stored by the FileStreamSink
      
      marmbrus zsxwing
      
      ## How was this patch tested?
      
      The following tests were added.
      
      ### StreamExecution offset metadata
      Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain
      
      ### CompactibleFileStreamLog
      Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #16219 from tcondie/offset_hist.
      83a42897
  3. Dec 11, 2016
  4. Dec 10, 2016
    • wangzhenhua's avatar
      [SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary... · a29ee55a
      wangzhenhua authored
      [SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary column having only null values
      
      ## What changes were proposed in this pull request?
      
      During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null.
      
      ## How was this patch tested?
      
      Add a test for handling null columns
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16243 from wzhfy/nullStats.
      a29ee55a
    • hyukjinkwon's avatar
      [SPARK-18803][TESTS] Fix JarEntry-related & path-related test failures and... · e094d011
      hyukjinkwon authored
      [SPARK-18803][TESTS] Fix JarEntry-related & path-related test failures and skip some tests by path length limitation on Windows
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix some tests being failed on Windows as below for several problems.
      
      ### Incorrect path handling
      
      - FileSuite
        ```
        [info] - binary file input as byte array *** FAILED *** (500 milliseconds)
        [info]   "file:/C:/projects/spark/target/tmp/spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624/record-bytestream-00000.bin" did not contain "C:\projects\spark\target\tmp\spark-e7c3a3b8-0a4b-4a7f-9ebe-7c4883e48624\record-bytestream-00000.bin" (FileSuite.scala:258)
        [info]   org.scalatest.exceptions.TestFailedException:
        [info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
        ...
        ```
        ```
        [info] - Get input files via old Hadoop API *** FAILED *** (1 second, 94 milliseconds)
        [info]   Set("/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00000", "/C:/projects/spark/target/tmp/spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00000", "C:\projects\spark\target\tmp\spark-cf5b1f8b-c5ed-43e0-8d17-546ebbfa8200\output/part-00001") (FileSuite.scala:535)
        [info]   org.scalatest.exceptions.TestFailedException:
        [info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
        ...
        ```
      
        ```
        [info] - Get input files via new Hadoop API *** FAILED *** (313 milliseconds)
        [info]   Set("/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00000", "/C:/projects/spark/target/tmp/spark-12bc1540-1111-4df6-9c4d-79e0e614407c/output/part-00001") did not equal Set("C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00000", "C:\projects\spark\target\tmp\spark-12bc1540-1111-4df6-9c4d-79e0e614407c\output/part-00001") (FileSuite.scala:549)
        [info]   org.scalatest.exceptions.TestFailedException:
        ...
        ```
      
      - TaskResultGetterSuite
      
        ```
        [info] - handling results larger than max RPC message size *** FAILED *** (1 second, 579 milliseconds)
        [info]   1 did not equal 0 Expect result to be removed from the block manager. (TaskResultGetterSuite.scala:129)
        [info]   org.scalatest.exceptions.TestFailedException:
        [info]   ...
        [info]   Cause: java.net.URISyntaxException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
        [info]   at java.net.URI$Parser.fail(URI.java:2848)
        [info]   at java.net.URI$Parser.checkChars(URI.java:3021)
        ...
        ```
        ```
        [info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (0 milliseconds)
        [info]   java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\spark-93c485af-68da-440f-a907-aac7acd5fc25\repro\MyException.java
        [info]   at java.net.URI.create(URI.java:852)
        ...
        ```
      
      - SparkSubmitSuite
      
        ```
        [info]   java.lang.IllegalArgumentException: Illegal character in path at index 12: string:///C:\projects\spark\target\tmp\1481210831381-0\870903339\MyLib.java
        [info]   at java.net.URI.create(URI.java:852)
        [info]   at org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:112)
        ...
        ```
      
      ### Incorrect separate for JarEntry
      
      After the path fix from above, then `TaskResultGetterSuite` throws another exception as below:
      
      ```
      [info] - failed task deserialized with the correct classloader (SPARK-11195) *** FAILED *** (907 milliseconds)
      [info]   java.lang.ClassNotFoundException: repro.MyException
      [info]   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
      ...
      ```
      
      This is because `Paths.get` concatenates the given paths to an OS-specific path (Windows `\` and Linux `/`). However, for `JarEntry` we should comply ZIP specification meaning it should be always `/` according to ZIP specification.
      
      See `4.4.17 file name: (Variable)` in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
      
      ### Long path problem on Windows
      
      Some tests in `ShuffleSuite` via `ShuffleNettySuite` were skipped due to the same reason with SPARK-18718
      
      ## How was this patch tested?
      
      Manually via AppVeyor.
      
      **Before**
      
      - `FileSuite`, `TaskResultGetterSuite`,`SparkSubmitSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/164-tmp-windows-base (please grep each to check each)
      - `ShuffleSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/157-tmp-windows-base
      
      **After**
      
      - `FileSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/166-FileSuite
      - `TaskResultGetterSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/173-TaskResultGetterSuite
      - `SparkSubmitSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/167-SparkSubmitSuite
      - `ShuffleSuite`
        https://ci.appveyor.com/project/spark-test/spark/build/176-ShuffleSuite
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16234 from HyukjinKwon/test-errors-windows.
      Unverified
      e094d011
    • Michal Senkyr's avatar
      [SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8 · 11432483
      Michal Senkyr authored
      ## What changes were proposed in this pull request?
      
      The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc.
      
      Replace `>` with literals in Javadoc to allow the build to pass.
      
      ## How was this patch tested?
      
      Documentation was built and inspected manually to ensure it still displays correctly in the browser
      
      ```
      cd docs && jekyll serve
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16201 from michalsenkyr/javadoc8-gt-fix.
      Unverified
      11432483
    • gatorsmile's avatar
      [SPARK-18766][SQL] Push Down Filter Through BatchEvalPython (Python UDF) · 422a45cf
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Currently, when users use Python UDF in Filter, BatchEvalPython is always generated below FilterExec. However, not all the predicates need to be evaluated after Python UDF execution. Thus, this PR is to push down the determinisitc predicates through `BatchEvalPython`.
      ```Python
      >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"])
      >>> from pyspark.sql.functions import udf, col
      >>> from pyspark.sql.types import BooleanType
      >>> my_filter = udf(lambda a: a < 2, BooleanType())
      >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & (df.value < "2"))
      >>> sel.explain(True)
      ```
      Before the fix, the plan looks like
      ```
      == Optimized Logical Plan ==
      Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
      +- LogicalRDD [key#0L, value#1]
      
      == Physical Plan ==
      *Project [key#0L, value#1]
      +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
         +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
            +- Scan ExistingRDD[key#0L,value#1]
      ```
      
      After the fix, the plan looks like
      ```
      == Optimized Logical Plan ==
      Filter ((isnotnull(value#1) && <lambda>(key#0L)) && (value#1 < 2))
      +- LogicalRDD [key#0L, value#1]
      
      == Physical Plan ==
      *Project [key#0L, value#1]
      +- *Filter pythonUDF0#9: boolean
         +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
            +- *Filter (isnotnull(value#1) && (value#1 < 2))
               +- Scan ExistingRDD[key#0L,value#1]
      ```
      
      ### How was this patch tested?
      Added both unit test cases for `BatchEvalPythonExec` and also add an end-to-end test case in Python test suite.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16193 from gatorsmile/pythonUDFPredicatePushDown.
      422a45cf
    • WangTaoTheTonic's avatar
      [SPARK-18606][HISTORYSERVER] remove useless elements while searching · 3a3e65ad
      WangTaoTheTonic authored
      ## What changes were proposed in this pull request?
      
      When we search applications in HistoryServer, it will include all contents between <td> tag, which including useless elemtns like "<span title...", "a href" and making results confused.
      We should remove those to make it clear.
      
      ## How was this patch tested?
      
      manual tests.
      
      Before:
      ![before](https://cloud.githubusercontent.com/assets/5276001/20662840/28bcc874-b590-11e6-9115-12fb64e49898.jpg)
      
      After:
      ![after](https://cloud.githubusercontent.com/assets/5276001/20662844/2f717af2-b590-11e6-97dc-a48b08a54247.jpg)
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #16031 from WangTaoTheTonic/span.
      Unverified
      3a3e65ad
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Remove Apache Spark Wiki address · f3a3fed7
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links.
      
      ```
      All current wiki content has been merged into pages at http://spark.apache.org as of November 2016.
      Each page links to the new location of its information on the Spark web site.
      Obsolete wiki content is still hosted here, but carries a notice that it is no longer current.
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      - `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme
      - `docs/index.md`:
      ```
      cd docs
      SKIP_API=1 jekyll build
      ```
      ![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.
      Unverified
      f3a3fed7
    • Huaxin Gao's avatar
      [SPARK-17460][SQL] Make sure sizeInBytes in Statistics will not overflow · c5172568
      Huaxin Gao authored
      ## What changes were proposed in this pull request?
      
      1. In SparkStrategies.canBroadcast, I will add the check   plan.statistics.sizeInBytes >= 0
      2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow.
      
      ## How was this patch tested?
      
      I will add a test case to make sure the statistics.sizeInBytes won't overflow.
      
      Author: Huaxin Gao <huaxing@us.ibm.com>
      
      Closes #16175 from huaxingao/spark-17460.
      c5172568
    • Burak Yavuz's avatar
      [SPARK-18811] StreamSource resolution should happen in stream execution thread · 63c91598
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      When you start a stream, if we are trying to resolve the source of the stream, for example if we need to resolve partition columns, this could take a long time. This long execution time should not block the main thread where `query.start()` was called on. It should happen in the stream execution thread possibly before starting any triggers.
      
      ## How was this patch tested?
      
      Unit test added. Made sure test fails with no code changes.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16238 from brkyvz/SPARK-18811.
      63c91598
  5. Dec 09, 2016
    • Felix Cheung's avatar
      [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values · 3e11d5bf
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE.
      example:
      ```
      > setLogLevel("WARN")
      NULL
      ```
      We should fix this to make the result more clear.
      
      Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it.
      
      ## How was this patch tested?
      
      manually - I didn't find a expect_*() method in testthat for this
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16237 from felixcheung/rinvis.
      3e11d5bf
    • Xiangrui Meng's avatar
      [SPARK-18812][MLLIB] explain "Spark ML" · d2493a20
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion.
      
      I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems too high-level for the content here. So I added it to the MLlib user guide instead.
      
      cc: mateiz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16241 from mengxr/SPARK-18812.
      d2493a20
    • Davies Liu's avatar
      [SPARK-4105] retry the fetch or stage if shuffle block is corrupt · cf33a862
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      There is an outstanding issue that existed for a long time: Sometimes the shuffle blocks are corrupt and can't be decompressed. We recently hit this in three different workloads, sometimes we can reproduce it by every try, sometimes can't. I also found that when the corruption happened, the beginning and end of the blocks are correct, the corruption happen in the middle. There was one case that the string of block id is corrupt by one character. It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.
      
      Unfortunately, Spark does not have checksum for shuffle blocks or broadcast, the job will fail if any corruption happen in the shuffle block from disk, or broadcast blocks during network. This PR try to detect the corruption after fetching shuffle blocks by decompressing them, because most of the compression already have checksum in them. It will retry the block, or failed with FetchFailure, so the previous stage could be retried on different (still random) machines.
      
      Checksum for broadcast will be added by another PR.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15923 from davies/detect_corrupt.
      cf33a862
    • Kazuaki Ishizaki's avatar
      [SPARK-18745][SQL] Fix signed integer overflow due to toInt cast · d60ab5fd
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer).
      
      ## How was this patch tested?
      
      Manually executed query68 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #16235 from kiszk/SPARK-18745.
      d60ab5fd
    • Takeshi YAMAMURO's avatar
      [SPARK-18620][STREAMING][KINESIS] Flatten input rates in timeline for streaming + kinesis · b08b5004
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      This pr is to make input rates in timeline more flat for spark streaming + kinesis.
      Since kinesis workers fetch records and push them into block generators in bulk, timeline in web UI has many spikes when `maxRates` applied (See a Figure.1 below). This fix splits fetched input records into multiple `adRecords` calls.
      
      Figure.1 Apply `maxRates=500` in vanilla Spark
      <img width="1084" alt="apply_limit in_vanilla_spark" src="https://cloud.githubusercontent.com/assets/692303/20823861/4602f300-b89b-11e6-95f3-164a37061305.png">
      
      Figure.2 Apply `maxRates=500` in Spark with my patch
      <img width="1056" alt="apply_limit in_spark_with_my_patch" src="https://cloud.githubusercontent.com/assets/692303/20823882/6c46352c-b89b-11e6-81ab-afd8abfe0cfe.png">
      
      ## How was this patch tested?
      Add tests to check to split input records into multiple `addRecords` calls.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #16114 from maropu/SPARK-18620.
      Unverified
      b08b5004
    • Shivaram Venkataraman's avatar
      [MINOR][SPARKR] Fix SparkR regex in copy command · be5fc6ef
      Shivaram Venkataraman authored
      Fix SparkR package copy regex. The existing code leads to
      ```
      Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f9-bin
      mput: SparkR-*: no files found
      ```
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16231 from shivaram/typo-sparkr-build.
      be5fc6ef
    • Xiangrui Meng's avatar
      [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend · fd48d80a
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly.
      * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong.
      * Small refactor of `SerDe.sqlSerDe` to increase readability.
      
      ## How was this patch tested?
      
      * Added unit tests for `JVMObjectTracker`.
      * Wait for Jenkins to run full tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #16154 from mengxr/SPARK-17822.
      fd48d80a
    • Jacek Laskowski's avatar
      [MINOR][CORE][SQL][DOCS] Typo fixes · b162cc0c
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Typo fixes
      
      ## How was this patch tested?
      
      Local build. Awaiting the official build.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #16144 from jaceklaskowski/typo-fixes.
      Unverified
      b162cc0c
    • Zhan Zhang's avatar
      [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic · 67587d96
      Zhan Zhang authored
      ## What changes were proposed in this pull request?
      
      Make stateful udf as nondeterministic
      
      ## How was this patch tested?
      Add new test cases with both Stateful and Stateless UDF.
      Without the patch, the test cases will throw exception:
      
      1 did not equal 10
      ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501)
      org.scalatest.exceptions.TestFailedException: 1 did not equal 10
              at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
              at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
              ...
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #16068 from zhzhan/state.
      67587d96
    • Felix Cheung's avatar
      Copy pyspark and SparkR packages to latest release dir too · c074c96d
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822)
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16227 from felixcheung/pyrftp.
      c074c96d
    • Shivaram Venkataraman's avatar
      Copy the SparkR source package with LFTP · 934035ae
      Shivaram Venkataraman authored
      This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #16226 from shivaram/fix-sparkr-copy-build.
      934035ae
    • Weiqing Yang's avatar
      [SPARK-18697][BUILD] Upgrade sbt plugins · 9338aa4f
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
      ```
      sbteclipse-plugin: 4.0.0 -> 5.0.1
      sbt-mima-plugin: 0.1.11 -> 0.1.12
      org.ow2.asm/asm: 5.0.3 -> 5.1
      org.ow2.asm/asm-commons: 5.0.3 -> 5.1
      ```
      ## How was this patch tested?
      Pass the Jenkins build.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16223 from weiqingy/SPARK_18697.
      Unverified
      9338aa4f
    • wm624@hotmail.com's avatar
      [SPARK-18349][SPARKR] Update R API documentation on ml model summary · 86a96034
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      In this PR, the document of `summary` method is improved in the format:
      
      returns summary information of the fitted model, which is a list. The list includes .......
      
      Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here.
      
      In current document, some `return` have `.` and some don't have. `.` is added to missed ones.
      
      Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged.
      
      ## How was this patch tested?
      
      Manual build.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16150 from wangmiao1981/audit2.
      86a96034
Loading