Skip to content
Snippets Groups Projects
  1. Aug 25, 2015
    • Sean Owen's avatar
      [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters · 69c9c177
      Sean Owen authored
      Replace `JavaConversions` implicits with `JavaConverters`
      
      Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8033 from srowen/SPARK-9613.
      69c9c177
    • ehnalis's avatar
      Fixed a typo in DAGScheduler. · 7f1e507b
      ehnalis authored
      Author: ehnalis <zoltan.zvara@gmail.com>
      
      Closes #8308 from ehnalis/master.
      7f1e507b
    • Zhang, Liye's avatar
      [DOC] add missing parameters in SparkContext.scala for scala doc · 5c148901
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #8412 from liyezhang556520/minorDoc.
      5c148901
    • Yin Huai's avatar
      [SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors). · 0e6368ff
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10197
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8407 from yhuai/ORCSPARK-10197.
      0e6368ff
    • Josh Rosen's avatar
      [SPARK-10195] [SQL] Data sources Filter should not expose internal types · 7bc9a8c6
      Josh Rosen authored
      Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.
      
      This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0.  To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
      7bc9a8c6
    • Davies Liu's avatar
      [SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive · 2f493f7e
      Davies Liu authored
      We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.
      
      In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).
      
      Author: Davies Liu <davies@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8400 from davies/timestamp_parquet.
      2f493f7e
    • Tathagata Das's avatar
      [SPARK-10210] [STREAMING] Filter out non-existent blocks before creating BlockRDD · 1fc37581
      Tathagata Das authored
      When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled).
      
      This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist.
      
      The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8405 from tdas/SPARK-10210.
      1fc37581
    • Sean Owen's avatar
      [SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided · 57b960bf
      Sean Owen authored
      Follow up to https://github.com/apache/spark/pull/7047
      
      pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used.
      
      CC trystanleftwich
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8338 from srowen/SPARK-6196.
      57b960bf
    • Yu ISHIKAWA's avatar
      [SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs · d4549fe5
      Yu ISHIKAWA authored
      cc: shivaram
      
      ## Summary
      
      - Add name tags to each methods in DataFrame.R and column.R
      - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` =>  `rdname alias`
      
      ## Generated PDF File
      https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing
      
      ## JIRA
      [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8414 from yu-iskw/SPARK-10214.
      d4549fe5
    • Josh Rosen's avatar
      [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only... · 82268f07
      Josh Rosen authored
      [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns
      
      This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.
      
      I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7631 from JoshRosen/SPARK-9293.
      82268f07
    • Cheng Lian's avatar
      [SPARK-10136] [SQL] A more robust fix for SPARK-10136 · bf03fe68
      Cheng Lian authored
      PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause.  The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules.  Let me have a try to give an explanation here.
      
      The structure of the problematic Parquet schema generated by parquet-avro is something like this:
      
      ```
      message m {
        <repetition> group f (LIST) {         // Level 1
          repeated group array (LIST) {       // Level 2
            repeated <primitive-type> array;  // Level 3
          }
        }
      }
      ```
      
      (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)
      
      This structure consists of two nested legacy 2-level `LIST`-like structures:
      
      1. The repeated group type at level 2 is the element type of the outer array defined at level 1
      
         This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.
      
      2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2
      
         This group should also map to an `CatalystArrayConverter.ElementConverter`.
      
      The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1.  Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.
      
      According to  parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix.  (I didn't realize this when authoring #8341 though.)
      
      As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:
      
      > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
      
      (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)
      
      This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers a more robust fix by adding this rule in the latter method.
      
      Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].
      
      [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
      [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
      [3]: https://issues.apache.org/jira/browse/PARQUET-364
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8361 from liancheng/spark-10136/proper-version.
      bf03fe68
    • Yin Huai's avatar
      [SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON. · df7041d0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10196
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8408 from yhuai/DecimalJsonSPARK-10196.
      df7041d0
    • zsxwing's avatar
      [SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers returns balanced results · f023aa2f
      zsxwing authored
      This PR fixes the following cases for `ReceiverSchedulingPolicy`.
      
      1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1).
      Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested,  and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested.
      
      This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`.
      
      2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle.
      
      This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8340 from zsxwing/fix-receiver-scheduling.
      f023aa2f
    • cody koeninger's avatar
      [SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa… · d9c25dec
      cody koeninger authored
      …ult maxRatePerPartition setting of 0
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #8413 from koeninger/backpressure-testing-master.
      d9c25dec
    • Michael Armbrust's avatar
      [SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables · 5175ca0c
      Michael Armbrust authored
      In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results.  To aid debugging this patch improves the harness to also print these query plans and their results.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #8388 from marmbrus/generatedTables.
      5175ca0c
  2. Aug 24, 2015
    • Yin Huai's avatar
      [SPARK-10121] [SQL] Thrift server always use the latest class loader provided... · a0c0aae1
      Yin Huai authored
      [SPARK-10121] [SQL] Thrift server always use the latest class loader provided by the conf of executionHive's state
      
      https://issues.apache.org/jira/browse/SPARK-10121
      
      Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8368 from yhuai/SPARK-10121.
      a0c0aae1
    • Feynman Liang's avatar
      [SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of Products · 642c43c8
      Feynman Liang authored
       * Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter
       * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8406 from feynmanliang/sql-doc-fixes.
      642c43c8
    • Yu ISHIKAWA's avatar
      [SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release · 6511bf55
      Yu ISHIKAWA authored
      cc: shivaram
      
      ## Summary
      
      - Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii`
      - Replace the dynamical function definitions to the static ones because of thir documentations.
      
      ## Generated PDF File
      https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing
      
      ## JIRA
      [[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8386 from yu-iskw/SPARK-10118.
      6511bf55
    • Michael Armbrust's avatar
      [SPARK-10165] [SQL] Await child resolution in ResolveFunctions · 2bf338c6
      Michael Armbrust authored
      Currently, we eagerly attempt to resolve functions, even before their children are resolved.  However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs).
      
      As a fix, this PR delays function resolution until the functions children are resolved.  This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses).  Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions.  To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #8371 from marmbrus/hiveUDFResolution.
      2bf338c6
    • Josh Rosen's avatar
      [SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter · d7b4c095
      Josh Rosen authored
      This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8401 from JoshRosen/SPARK-10190.
      d7b4c095
    • Joseph K. Bradley's avatar
      [SPARK-10061] [DOC] ML ensemble docs · 13db11cb
      Joseph K. Bradley authored
      User guide for spark.ml GBTs and Random Forests.
      The examples are copied from the decision tree guide and modified to run.
      
      I caught some issues I had somehow missed in the tree guide as well.
      
      I have run all examples, including Java ones.  (Of course, I thought I had previously as well...)
      
      CC: mengxr manishamde yanboliang
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #8369 from jkbradley/ml-ensemble-docs.
      13db11cb
    • Sean Owen's avatar
      [SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package? · cb2d2e15
      Sean Owen authored
      Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.*
      
      Alternate take, per discussion at https://github.com/apache/spark/pull/8051
      I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8307 from srowen/SPARK-9758.
      cb2d2e15
    • Cheng Lian's avatar
      [SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more test cases · a2f4cdce
      Cheng Lian authored
      This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases.
      
      Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests.
      a2f4cdce
    • Andrew Or's avatar
      [SPARK-10144] [UI] Actually show peak execution memory by default · 662bb966
      Andrew Or authored
      The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8345 from andrewor14/show-memory-default.
      662bb966
    • Burak Yavuz's avatar
      [SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions · 9ce0c7ad
      Burak Yavuz authored
      This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`.
      
      rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #8378 from brkyvz/update-sql-docs.
      9ce0c7ad
    • Tathagata Das's avatar
      [SPARK-9791] [PACKAGE] Change private class to private class to prevent... · 7478c8b6
      Tathagata Das authored
      [SPARK-9791] [PACKAGE] Change private class to private class to prevent unnecessary classes from showing up in the docs
      
      In addition, some random cleanup of import ordering
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8387 from tdas/SPARK-9791 and squashes the following commits:
      
      67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
      7478c8b6
    • zsxwing's avatar
      [SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars · 4e0395dd
      zsxwing authored
      This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build.
      
      I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:
      
      e0b5818 [zsxwing] Fix the sbt build
      c697627 [zsxwing] Add the jar pathes to the exception message
      be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
      4e0395dd
  3. Aug 23, 2015
    • Tathagata Das's avatar
      [SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local... · 053d94fc
      Tathagata Das authored
      [SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts
      
      The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following:
      1. Use the same code path as Java to check whether a valid checkpoint exists
      2. Create a new Python SparkContext only if there no active one.
      
      There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8366 from tdas/SPARK-10142 and squashes the following commits:
      
      3afa666 [Tathagata Das] Added tests
      2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists
      9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
      053d94fc
    • Joseph K. Bradley's avatar
      [SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug · b963c19a
      Joseph K. Bradley authored
      GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
      
      This PR adds a unit test which checks this.  It failed previously but works with this fix.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #8370 from jkbradley/gmm-fix.
      b963c19a
    • zsxwing's avatar
      [SPARK-10148] [STREAMING] Display active and inactive receiver numbers in Streaming page · c6df5f66
      zsxwing authored
      Added the active and inactive receiver numbers in the summary section of Streaming page.
      
      <img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm" src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png">
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8351 from zsxwing/receiver-number.
      c6df5f66
    • Keiji Yoshida's avatar
      Update streaming-programming-guide.md · 623c675f
      Keiji Yoshida authored
      Update `See the Scala example` to `See the Java example`.
      
      Author: Keiji Yoshida <yoshida.keiji.84@gmail.com>
      
      Closes #8376 from yosssi/patch-1.
      623c675f
  4. Aug 22, 2015
  5. Aug 21, 2015
    • Xusen Yin's avatar
      [SPARK-9893] User guide with Java test suite for VectorSlicer · 630a994e
      Xusen Yin authored
      Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
      
      Note that Python version does not support selecting by names now.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #8267 from yinxusen/SPARK-9893.
      630a994e
    • Joseph K. Bradley's avatar
      [SPARK-10163] [ML] Allow single-category features for GBT models · f01c4220
      Joseph K. Bradley authored
      Removed categorical feature info validation since no longer needed
      
      This is needed to make the ML user guide examples work (in another current PR).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #8367 from jkbradley/gbt-single-cat.
      f01c4220
    • Yin Huai's avatar
      [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the... · e3355090
      Yin Huai authored
      [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.
      
      https://issues.apache.org/jira/browse/SPARK-10143
      
      With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data.
      
      I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference.
      
      Without this PR:
      ![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)
      
      With this PR:
      ![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)
      
      Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size.
      
      Tested it on a cluster using
      ```
      val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
      ```
      Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8346 from yhuai/parquetMinSplit.
      e3355090
    • MechCoder's avatar
      [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation · f5b028ed
      MechCoder authored
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #8352 from MechCoder/since.
      f5b028ed
    • jerryshao's avatar
      [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function · d89cc38b
      jerryshao authored
      Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122).
      
      tdas , please help to review.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits:
      
      4039b16 [jerryshao] Fix getOffsetRanges in transform() bug
      d89cc38b
    • Daoyuan Wang's avatar
      [SPARK-10130] [SQL] type coercion for IF should have children resolved first · 3c462f5d
      Daoyuan Wang authored
      Type coercion for IF should have children resolved first, or we could meet unresolved exception.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #8331 from adrian-wang/spark10130.
      3c462f5d
    • Imran Rashid's avatar
      [SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb · 708036c1
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-9439
      
      In general, Yarn apps should be robust to NodeManager restarts.  However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor.  (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us *find* those files.)
      
      The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added.  When running with yarn, that file is in the NM's local dir.  Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there.
      
      Nothing is changed in non-yarn modes with this patch.  The service is not given a place to save the state to, so it operates the same as before.  This should make it easy to update other cluster managers as well, by just supplying the right file & the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits:
      
      0d285d3 [Imran Rashid] review feedback
      70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
      5c71c8c [Imran Rashid] save executor to db before registering; style
      2499c8c [Imran Rashid] explicit dependency on jackson-annotations
      795d28f [Imran Rashid] review feedback
      81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart
      594d520 [Imran Rashid] use json to serialize application executor info
      1a7980b [Imran Rashid] version
      8267d2a [Imran Rashid] style
      e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little
      9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files
      acedb62 [Imran Rashid] switch to writing out one record per executor
      79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup
      12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests
      c878fbe [Imran Rashid] better explanation of shuffle service port handling
      694934c [Imran Rashid] only open leveldb connection once per service
      d596410 [Imran Rashid] store executor data in leveldb
      59800b7 [Imran Rashid] Files.move in case renaming is unsupported
      32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
      d7450f0 [Imran Rashid] style
      f729e2b [Imran Rashid] debugging
      4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks
      0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart
      55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests
      245db19 [Imran Rashid] style
      62586a6 [Imran Rashid] just serialize the whole executors map
      bdbbf0d [Imran Rashid] comments, remove some unnecessary changes
      857331a [Imran Rashid] better tests & comments
      bb9d1e6 [Imran Rashid] formatting
      bdc4b32 [Imran Rashid] rename
      86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port
      23994ff [Imran Rashid] style
      7504de8 [Imran Rashid] style
      a36729c [Imran Rashid] cleanup
      efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart
      dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts
      d596969 [Imran Rashid] cleanup imports
      0e9d69b [Imran Rashid] better names
      9eae119 [Imran Rashid] cleanup lots of duplication
      1136f44 [Imran Rashid] test needs to have an actual shuffle
      0b588bd [Imran Rashid] more fixes ...
      ad122ef [Imran Rashid] more fixes
      5e5a7c3 [Imran Rashid] fix build
      c69f46b [Imran Rashid] maybe working version, needs tests & cleanup ...
      bb3ba49 [Imran Rashid] minor cleanup
      36127d3 [Imran Rashid] wip
      b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests
      708036c1
Loading