Skip to content
Snippets Groups Projects
  1. Oct 21, 2016
    • Jagadeesan's avatar
      [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] · 595893d3
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      1) Upgrade the Py4J version on the Java side
      2) Update the py4j src zip file we bundle with Spark
      
      ## How was this patch tested?
      
      Existing doctests & unit tests pass
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15514 from jagadeesanas2/SPARK-17960.
      Unverified
      595893d3
  2. Oct 20, 2016
    • WeichenXu's avatar
      [SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing · 39755169
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records.
      
      - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records.
      
      ## How was this patch tested?
      
      test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.
      39755169
  3. Oct 19, 2016
    • Alex Bozarth's avatar
      [SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own... · 444c2d22
      Alex Bozarth authored
      [SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps
      
      ## What changes were proposed in this pull request?
      
      I've added a method to `ApplicationHistoryProvider` that returns the html paragraph to display when there are no applications. This allows providers other than `FsHistoryProvider` to determine what is printed. The current hard coded text is now moved into `FsHistoryProvider` since it assumed that's what was being used before.
      
      I chose to make the function return html rather than text because the current text block had inline html in it and it allows a new implementation of `ApplicationHistoryProvider` more versatility. I did not see any security issues with this since injecting html here requires implementing `ApplicationHistoryProvider` and can't be done outside of code.
      
      ## How was this patch tested?
      
      Manual testing and dev/run-tests
      
      No visible changes to the UI
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15490 from ajbozarth/spark10541.
      444c2d22
  4. Oct 18, 2016
    • Yu Peng's avatar
      [SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error · 2629cd74
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      Fix hadoop2.2 compilation error.
      
      ## How was this patch tested?
      
      Existing tests.
      
      cc tdas zsxwing
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #15537 from loneknightpy/fix-17711.
      2629cd74
    • Guoqiang Li's avatar
      [SPARK-17930][CORE] The SerializerInstance instance used when deserializing a... · 4518642a
      Guoqiang Li authored
      [SPARK-17930][CORE] The SerializerInstance instance used when deserializing a TaskResult is not reused
      
      ## What changes were proposed in this pull request?
      The following code is called when the DirectTaskResult instance is deserialized
      
      ```scala
      
        def value(): T = {
          if (valueObjectDeserialized) {
            valueObject
          } else {
            // Each deserialization creates a new instance of SerializerInstance, which is very time-consuming
            val resultSer = SparkEnv.get.serializer.newInstance()
            valueObject = resultSer.deserialize(valueBytes)
            valueObjectDeserialized = true
            valueObject
          }
        }
      
      ```
      
      In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times
      
      The test data is TPC-DS 2T (Parquet) and  SQL statement as follows (query 2):
      
      ```sql
      
      select  i_item_id,
              avg(ss_quantity) agg1,
              avg(ss_list_price) agg2,
              avg(ss_coupon_amt) agg3,
              avg(ss_sales_price) agg4
       from store_sales, customer_demographics, date_dim, item, promotion
       where ss_sold_date_sk = d_date_sk and
             ss_item_sk = i_item_sk and
             ss_cdemo_sk = cd_demo_sk and
             ss_promo_sk = p_promo_sk and
             cd_gender = 'M' and
             cd_marital_status = 'M' and
             cd_education_status = '4 yr Degree' and
             (p_channel_email = 'N' or p_channel_event = 'N') and
             d_year = 2001
       group by i_item_id
       order by i_item_id
       limit 100;
      
      ```
      
      `spark-defaults.conf` file:
      
      ```
      spark.master                           yarn-client
      spark.executor.instances               20
      spark.driver.memory                    16g
      spark.executor.memory                  30g
      spark.executor.cores                   5
      spark.default.parallelism              100
      spark.sql.shuffle.partitions           100000
      spark.serializer                       org.apache.spark.serializer.KryoSerializer
      spark.driver.maxResultSize              0
      spark.rpc.netty.dispatcher.numThreads   8
      spark.executor.extraJavaOptions          -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
      spark.cleaner.referenceTracking.blocking true
      spark.cleaner.referenceTracking.blocking.shuffle true
      
      ```
      
      Performance test results are as follows
      
      [SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)| [ed146334](https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3])
      ------------ | -------------
      54.5 s|231.7 s
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #15512 from witgo/SPARK-17930.
      4518642a
    • Yu Peng's avatar
      [SPARK-17711] Compress rolled executor log · 231f39e3
      Yu Peng authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for executor log compression.
      
      ## How was this patch tested?
      
      Unit tests
      
      cc: yhuai tdas mengxr
      
      Author: Yu Peng <loneknightpy@gmail.com>
      
      Closes #15285 from loneknightpy/compress-executor-log.
      231f39e3
    • Liwei Lin's avatar
      [SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite · 7d878cf2
      Liwei Lin authored
      This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it.
      
      ## What changes were proposed in this pull request?
      There were two sources of flakiness in StreamingQueryListener test.
      
      - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock.
      ```
      +-----------------------------------+--------------------------------+
      |      StreamExecution thread       |         testing thread         |
      +-----------------------------------+--------------------------------+
      |  ManualClock.waitTillTime(100) {  |                                |
      |        _isWaiting = true          |                                |
      |            wait(10)               |                                |
      |        still in wait(10)          |  if (_isWaiting) advance(100)  |
      |        still in wait(10)          |  if (_isWaiting) advance(200)  | <- this should be disallowed !
      |        still in wait(10)          |  if (_isWaiting) advance(300)  | <- this should be disallowed !
      |      wake up from wait(10)        |                                |
      |       current time is 600         |                                |
      |       _isWaiting = false          |                                |
      |  }                                |                                |
      +-----------------------------------+--------------------------------+
      ```
      
      - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger.
      
      My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`).
      
      In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest.
      
      ## How was this patch tested?
      Ran existing unit test MANY TIME in Jenkins
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15519 from tdas/metrics-flaky-test-fix.
      7d878cf2
  5. Oct 17, 2016
    • Sital Kedia's avatar
      [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in... · c7ac027d
      Sital Kedia authored
      [SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer
      
      ## What changes were proposed in this pull request?
      
      Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline.
      
      ## How was this patch tested?
      Tested by running the job in the cluster and observed up to 7% speedup.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #15408 from sitalkedia/skedia/nio_spill_read.
      c7ac027d
  6. Oct 16, 2016
  7. Oct 15, 2016
    • Zhan Zhang's avatar
      [SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors · ed146334
      Zhan Zhang authored
      ## What changes were proposed in this pull request?
      
      Restructure the code and implement two new task assigner.
      PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.
      
      BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.
      
      By default, the original round robin assigner is used.
      
      We test a pipeline, and new PackedAssigner  save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.
      
      Author: Zhan Zhang <zhanzhang@fb.com>
      
      Closes #15218 from zhzhan/packed-scheduler.
      ed146334
  8. Oct 14, 2016
    • Michael Allman's avatar
      [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query · 6ce1b675
      Michael Allman authored
      (This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.)
      
      ## What changes were proposed in this pull request?
      
      In a new Spark session, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference.
      
      If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild.
      
      In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space.
      
      This PR proposes an alternative approach. Basically, it makes four changes:
      
      1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates.
      1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates.
      1. It removes partition loading and caching from `HiveMetastoreCatalog`.
      1. It adds a new Catalyst optimizer rule, `PruneFileSourcePartitions`, which applies a plan's partition-pruning predicates to prune out unnecessary partition files from a `HadoopFsRelation`'s underlying file catalog.
      
      The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The optimizer prunes-away unnecessary table partitions by sending the partition-pruning predicates to the relation's `TableFileCatalog `. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters.
      
      As a bonus, performing partition pruning during optimization leads to a more accurate relation size estimate. This, along with c481bdf5, can lead to automatic, safe application of the broadcast optimization in a join where it might previously have been omitted.
      
      ## Open Issues
      
      1. This PR omits partition metadata caching. I can add this once the overall strategy for the cold path is established, perhaps in a future PR.
      1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. This issue may be fixed by #14750, but that PR appears to have stalled. ericl has contributed to this PR a workaround for Parquet wherein schema reconciliation occurs at query execution time instead of planning. Whether ORC requires a similar patch is an open issue.
      1. This PR omits an implementation of `listPartitionsByFilter` for the `InMemoryCatalog`.
      1. This PR breaks parquet log output redirection during query execution. I can work around this by running `Class.forName("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$")` first thing in a Spark shell session, but I haven't figured out how to fix this properly.
      
      ## How was this patch tested?
      
      The current Spark unit tests were run, and some ad-hoc tests were performed to validate that only the necessary partition metadata is loaded.
      
      Author: Michael Allman <michael@videoamp.com>
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #14690 from mallman/spark-16980-lazy_partition_fetching.
      6ce1b675
    • invkrh's avatar
      [SPARK-17855][CORE] Remove query string from jar url · 28b645b1
      invkrh authored
      ## What changes were proposed in this pull request?
      
      Spark-submit support jar url with http protocol. However, if the url contains any query strings, `worker.DriverRunner.downloadUserJar()` method will throw "Did not see expected jar" exception. This is because this method checks the existance of a downloaded jar whose name contains query strings. This is a problem when your jar is located on some web service which requires some additional information to retrieve the file.
      
      This pr just removes query strings before checking jar existance on worker.
      
      ## How was this patch tested?
      
      For now, you can only test this patch by manual test.
      * Deploy a spark cluster locally
      * Make sure apache httpd service is on
      * Save an uber jar, e.g spark-job.jar under `/var/www/html/`
      * Use http://localhost/spark-job.jar?param=1 as jar url when running `spark-submit`
      * Job should be launched
      
      Author: invkrh <invkrh@gmail.com>
      
      Closes #15420 from invkrh/spark-17855.
      Unverified
      28b645b1
  9. Oct 13, 2016
    • jerryshao's avatar
      [SPARK-17686][CORE] Support printing out scala and java version with spark-submit --version command · 7bf8a404
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      In our universal gateway service we need to specify different jars to Spark according to scala version. For now only after launching Spark application can we know which version of Scala it depends on. It makes hard for us to support different Scala + Spark versions to pick the right jars.
      
      So here propose to print out Scala version according to Spark version in "spark-submit --version", so that user could leverage this output to make the choice without needing to launching application.
      
      ## How was this patch tested?
      
      Manually verified in local environment.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #15456 from jerryshao/SPARK-17686.
      7bf8a404
    • Alex Bozarth's avatar
      [SPARK-11272][WEB UI] Add support for downloading event logs from HistoryServer UI · 6f2fa6c5
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      This is a reworked PR based on feedback in #9238 after it was closed and not reopened. As suggested in that PR I've only added the download feature. This functionality already exists in the api and this allows easier access to download event logs to share with others.
      
      I've attached a screenshot of the committed version, but I will also include alternate options with screen shots in the comments below. I'm personally not sure which option is best.
      
      ## How was this patch tested?
      
      Manual testing
      
      ![screen shot 2016-10-07 at 6 11 12 pm](https://cloud.githubusercontent.com/assets/13952758/19209213/832fe48e-8cba-11e6-9840-749b1be4d399.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15400 from ajbozarth/spark11272.
      6f2fa6c5
  10. Oct 12, 2016
    • Imran Rashid's avatar
      [SPARK-17675][CORE] Expand Blacklist for TaskSets · 9ce7d3e5
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      This is a step along the way to SPARK-8425.
      
      To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
      * (task, executor) pairs (this already exists via an undocumented config)
      * (task, node)
      * (taskset, executor)
      * (taskset, node)
      
      Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.
      
      Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).
      
      ## How was this patch tested?
      
      Added unit tests, run tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: mwws <wei.mao@intel.com>
      
      Closes #15249 from squito/taskset_blacklist_only.
      9ce7d3e5
    • Shixiong Zhu's avatar
      [SPARK-17850][CORE] Add a flag to ignore corrupt files · 47776e7c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`.
      
      ## How was this patch tested?
      
      The added unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15422 from zsxwing/SPARK-17850.
      47776e7c
    • Hossein's avatar
      [SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB · 5cc503f4
      Hossein authored
      ## What changes were proposed in this pull request?
      If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
      
      I tested this on my MacBook. Following code works with this patch:
      ```R
      intMax <- .Machine$integer.max
      largeVec <- 1:intMax
      rdd <- SparkR:::parallelize(sc, largeVec, 2)
      ```
      
      ## How was this patch tested?
      * [x] Unit tests
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #15375 from falaki/SPARK-17790.
      5cc503f4
  11. Oct 11, 2016
    • Wenchen Fan's avatar
      [SPARK-17720][SQL] introduce static SQL conf · b9a14718
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897.
      
      Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf.
      
      ## How was this patch tested?
      
      new tests in SQLConfSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15295 from cloud-fan/global-conf.
      b9a14718
    • Bryan Cutler's avatar
      [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13 · 658c7147
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Upgraded to a newer version of Pyrolite which supports serialization of a BinaryType StructField for PySpark.SQL
      
      ## How was this patch tested?
      Added a unit test which fails with a raised ValueError when using the previous version of Pyrolite 4.9 and Python3
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
      Unverified
      658c7147
  12. Oct 10, 2016
    • Ergin Seyfe's avatar
      [SPARK-17816][CORE] Fix ConcurrentModificationException issue in BlockStatusesAccumulator · 19a5bae4
      Ergin Seyfe authored
      ## What changes were proposed in this pull request?
      Change the BlockStatusesAccumulator to return immutable object when value method is called.
      
      ## How was this patch tested?
      Existing tests plus I verified this change by running a pipeline which consistently repro this issue.
      
      This is the stack trace for this exception:
      `
      java.util.ConcurrentModificationException
              at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
              at java.util.ArrayList$Itr.next(ArrayList.java:851)
              at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
              at scala.collection.Iterator$class.foreach(Iterator.scala:893)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
              at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
              at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
              at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
              at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
              at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
              at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
              at scala.collection.AbstractTraversable.to(Traversable.scala:104)
              at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
              at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
              at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
              at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
              at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
              at scala.Option.map(Option.scala:146)
              at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
              at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
              at org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.immutable.List.foreach(List.scala:381)
              at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
              at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
              at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
              at scala.collection.AbstractTraversable.map(Traversable.scala:104)
              at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
              at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
              at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
      `
      
      Author: Ergin Seyfe <eseyfe@fb.com>
      
      Closes #15371 from seyfe/race_cond_jsonprotocal.
      19a5bae4
    • Dhruve Ashar's avatar
      [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing · 4bafacaa
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Currently the no. of partition files are limited to 10000 files (%05d format). If there are more than 10000 part files, the logic goes for a toss while recreating the RDD as it sorts them by string. More details can be found in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).
      
      ## How was this patch tested?
      I tested this patch by checkpointing a RDD and then manually renaming part files to the old format and tried to access the RDD. It was successfully created from the old format. Also verified loading a sample parquet file and saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them successfully back from the saved files. I couldn't launch the unit test from my local box, so will wait for the Jenkins output.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #15370 from dhruve/bug/SPARK-17417.
      4bafacaa
    • Wenchen Fan's avatar
      [SPARK-17338][SQL] add global temp view · 23ddff4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
      
      changes for `SessionCatalog`:
      
      1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
      2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
      3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
      4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
      5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
      6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
      7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.
      
      changes for SQL commands:
      
      1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
      2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
      3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.
      
      changes for other public API
      
      1. add a new method `dropGlobalTempView` in `Catalog`
      2. `Catalog.findTable` can find global temp view
      3. add a new method `createGlobalTempView` in `Dataset`
      
      ## How was this patch tested?
      
      new tests in `SQLViewSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14897 from cloud-fan/global-temp-view.
      23ddff4b
  13. Oct 08, 2016
  14. Oct 07, 2016
    • Sean Owen's avatar
      [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished · cff56075
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called.
      
      (I'm not sure we should change the Hive Thriftserver impl, but I did anyway.)
      
      This also adds `sc.stop()` to the quick start guide example.
      
      ## How was this patch tested?
      
      Existing tests; _pending_ at least manual verification of the fix.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15381 from srowen/SPARK-17707.
      cff56075
    • Brian Cho's avatar
      [SPARK-16827] Stop reporting spill metrics as shuffle metrics · e56614cb
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      Fix a bug where spill metrics were being reported as shuffle metrics. Eventually these spill metrics should be reported (SPARK-3577), but separate from shuffle metrics. The fix itself basically reverts the line to what it was in 1.6.
      
      ## How was this patch tested?
      
      Tested on a job that was reporting shuffle writes even for the final stage, when no shuffle writes should take place. After the change the job no longer shows these writes.
      
      Before:
      ![screen shot 2016-10-03 at 6 39 59 pm](https://cloud.githubusercontent.com/assets/1514239/19085897/dbf59a92-8a20-11e6-9f68-a978860c0d74.png)
      
      After:
      <img width="1052" alt="screen shot 2016-10-03 at 11 44 44 pm" src="https://cloud.githubusercontent.com/assets/1514239/19085903/e173a860-8a20-11e6-85e3-d47f9835f494.png">
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #15347 from dafrista/shuffle-metrics.
      e56614cb
    • Alex Bozarth's avatar
      [SPARK-17795][WEB UI] Sorting on stage or job tables doesn’t reload page on that table · 24097d84
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      Added anchor on table header id to sorting links on job and stage tables. This make the page reload after a sort load the page at the sorted table.
      
      This only changes page load behavior so no UI changes
      
      ## How was this patch tested?
      
      manually tested and dev/run-tests
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15369 from ajbozarth/spark17795.
      Unverified
      24097d84
  15. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
    • Shixiong Zhu's avatar
      [SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of BlockManagerSuite · 221b418b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Mock SparkContext to reduce memory usage of BlockManagerSuite
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15350 from zsxwing/SPARK-17778.
      221b418b
  16. Oct 04, 2016
    • sumansomasundar's avatar
      [SPARK-16962][CORE][SQL] Fix misaligned record accesses for SPARC architectures · 7d516088
      sumansomasundar authored
      ## What changes were proposed in this pull request?
      
      Made changes to record length offsets to make them uniform throughout various areas of Spark core and unsafe
      
      ## How was this patch tested?
      
      This change affects only SPARC architectures and was tested on X86 architectures as well for regression.
      
      Author: sumansomasundar <suman.somasundar@oracle.com>
      
      Closes #14762 from sumansomasundar/master.
      Unverified
      7d516088
    • Sean Owen's avatar
      [SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set... · 8e8de007
      Sean Owen authored
      [SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
      
      ## What changes were proposed in this pull request?
      
      Return Iterator of applications internally in history server, for consistency and performance. See https://github.com/apache/spark/pull/15248 for some back-story.
      
      The code called by and calling HistoryServer.getApplicationList wants an Iterator, but this method materializes an Iterable, which potentially causes a performance problem. It's simpler too to make this internal method also pass through an Iterator.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15321 from srowen/SPARK-17671.
      Unverified
      8e8de007
  17. Oct 02, 2016
    • Tao LI's avatar
      [SPARK-14914][CORE][SQL] Skip/fix some test cases on Windows due to limitation of Windows · 76dc2d90
      Tao LI authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix/skip some tests failed on Windows. This PR takes over https://github.com/apache/spark/pull/12696.
      
      **Before**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit *** FAILED *** (202 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specifie
      
      [info] - includes jars passed in through --jars *** FAILED *** (1 second, 625 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] - reads of memory-mapped and non memory-mapped files are equivalent *** FAILED *** (1 second, 78 milliseconds)
      [info]   diskStoreMapped.remove(blockId) was false (DiskStoreSuite.scala:41)
      ```
      
      **After**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit (578 milliseconds)
      [info] - includes jars passed in through --jars (1 second, 875 milliseconds)
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] DiskStoreSuite:
      [info] - reads of memory-mapped and non memory-mapped files are equivalent !!! CANCELED !!! (766 milliseconds
      ```
      
      For `CreateTableAsSelectSuite` and `FsHistoryProviderSuite`, I could not reproduce as the Java version seems higher than the one that has the bugs about `setReadable(..)` and `setWritable(...)` but as they are bugs reported clearly, it'd be sensible to skip those. We should revert the changes for both back as soon as we drop the support of Java 7.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      Closes #12696
      
      Author: Tao LI <tl@microsoft.com>
      Author: U-FAREAST\tl <tl@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15320 from HyukjinKwon/SPARK-14914.
      76dc2d90
  18. Oct 01, 2016
    • Eric Liang's avatar
      [SPARK-17740] Spark tests should mock / interpose HDFS to ensure that streams are closed · 4bcd9b72
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      As a followup to SPARK-17666, ensure filesystem connections are not leaked at least in unit tests. This is done here by intercepting filesystem calls as suggested by JoshRosen . At the end of each test, we assert no filesystem streams are left open.
      
      This applies to all tests using SharedSQLContext or SharedSparkContext.
      
      ## How was this patch tested?
      
      I verified that tests in sql and core are indeed using the filesystem backend, and fixed the detected leaks. I also checked that reverting https://github.com/apache/spark/pull/15245 causes many actual test failures due to connection leaks.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #15306 from ericl/sc-4672.
      4bcd9b72
  19. Sep 30, 2016
    • Shubham Chopra's avatar
      [SPARK-15353][CORE] Making peer selection for block replication pluggable · a26afd52
      Shubham Chopra authored
      ## What changes were proposed in this pull request?
      
      This PR makes block replication strategies pluggable. It provides two trait that can be implemented, one that maps a host to its topology and is used in the master, and the second that helps prioritize a list of peers for block replication and would run in the executors.
      
      This patch contains default implementations of these traits that make sure current Spark behavior is unchanged.
      
      ## How was this patch tested?
      
      This patch should not change Spark behavior in any way, and was tested with unit tests for storage.
      
      Author: Shubham Chopra <schopra31@bloomberg.net>
      
      Closes #13152 from shubhamchopra/RackAwareBlockReplication.
      a26afd52
  20. Sep 29, 2016
    • Imran Rashid's avatar
      [SPARK-17676][CORE] FsHistoryProvider should ignore hidden files · 3993ebca
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      FsHistoryProvider was writing a hidden file (to check the fs's clock).
      Even though it deleted the file immediately, sometimes another thread
      would try to scan the files on the fs in-between, and then there would
      be an error msg logged which was very misleading for the end-user.
      (The logged error was harmless, though.)
      
      ## How was this patch tested?
      
      I added one unit test, but to be clear, that test was passing before.  The actual change in behavior in that test is just logging (after the change, there is no more logged error), which I just manually verified.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15250 from squito/SPARK-17676.
      3993ebca
    • Brian Cho's avatar
      [SPARK-17715][SCHEDULER] Make task launch logs DEBUG · 027dea8f
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      Ramp down the task launch logs from INFO to DEBUG. Task launches can happen orders of magnitude more than executor registration so it makes the logs easier to handle if they are different log levels. For larger jobs, there can be 100,000s of task launches which makes the driver log huge.
      
      ## How was this patch tested?
      
      No tests, as this is a trivial change.
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #15290 from dafrista/ramp-down-task-logging.
      027dea8f
    • Gang Wu's avatar
      [SPARK-17672] Spark 2.0 history server web Ui takes too long for a single application · cb87b3ce
      Gang Wu authored
      Added a new API getApplicationInfo(appId: String) in class ApplicationHistoryProvider and class SparkUI to get app info. In this change, FsHistoryProvider can directly fetch one app info in O(1) time complexity compared to O(n) before the change which used an Iterator.find() interface.
      
      Both ApplicationCache and OneApplicationResource classes adopt this new api.
      
       manual tests
      
      Author: Gang Wu <wgtmac@uber.com>
      
      Closes #15247 from wgtmac/SPARK-17671.
      cb87b3ce
    • Imran Rashid's avatar
      [SPARK-17648][CORE] TaskScheduler really needs offers to be an IndexedSeq · 7f779e74
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      The Seq[WorkerOffer] is accessed by index, so it really should be an
      IndexedSeq, otherwise an O(n) operation becomes O(n^2).  In practice
      this hasn't been an issue b/c where these offers are generated, the call
      to `.toSeq` just happens to create an IndexedSeq anyway.I got bitten by
      this in performance tests I was doing, and its better for the types to be
      more precise so eg. a change in Scala doesn't destroy performance.
      
      ## How was this patch tested?
      
      Unit tests via jenkins.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #15221 from squito/SPARK-17648.
      7f779e74
  21. Sep 28, 2016
    • Weiqing Yang's avatar
      [SPARK-17710][HOTFIX] Fix ClassCircularityError in ReplSuite tests in Maven... · 7dfad4b1
      Weiqing Yang authored
      [SPARK-17710][HOTFIX] Fix ClassCircularityError in ReplSuite tests in Maven build: use 'Class.forName' instead of 'Utils.classForName'
      
      ## What changes were proposed in this pull request?
      Fix ClassCircularityError in ReplSuite tests when Spark is built by Maven build.
      
      ## How was this patch tested?
      (1)
      ```
      build/mvn -DskipTests -Phadoop-2.3 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package
      ```
      Then test:
      ```
      build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.repl.ReplSuite test
      ```
      ReplSuite tests passed
      
      (2)
      Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15286 from Sherry302/SPARK-16757.
      7dfad4b1
Loading