Skip to content
Snippets Groups Projects
  1. Nov 02, 2015
  2. Nov 01, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-9298][SQL] Add pearson correlation aggregation function · 3e770a64
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9298
      
      This patch adds pearson correlation aggregation function based on `AggregateExpression2`.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8587 from viirya/corr_aggregation.
      3e770a64
    • Marcelo Vanzin's avatar
      [SPARK-11073][CORE][YARN] Remove akka dependency in secret key generation. · f8d93ede
      Marcelo Vanzin authored
      Use standard JDK APIs for that (with a little help from Guava). Most of the
      changes here are in test code, since there were no tests specific to that
      part of the code.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9257 from vanzin/SPARK-11073.
      f8d93ede
    • Marcelo Vanzin's avatar
      [SPARK-11020][CORE] Wait for HDFS to leave safe mode before initializing HS. · cf04fdfe
      Marcelo Vanzin authored
      Large HDFS clusters may take a while to leave safe mode when starting; this change
      makes the HS wait for that before doing checks about its configuraton. This means
      the HS won't stop right away if HDFS is in safe mode and the configuration is not
      correct, but that should be a very uncommon situation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9043 from vanzin/SPARK-11020.
      cf04fdfe
    • Nong Li's avatar
      [SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY. · 046e32ed
      Nong Li authored
      DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for
      optioning sorting within each resulting partition. There is no required relationship between the
      exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other).
      
      This patch adds to APIs to DataFrames which can be used together to provide this functionality:
        1. distributeBy() which partitions the data frame into a specified number of partitions using the
           partitioning exprs.
        2. localSort() which sorts each partition using the provided sorting exprs.
      
      To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...)
      
      Author: Nong Li <nongli@gmail.com>
      
      Closes #9364 from nongli/spark-11410.
      046e32ed
    • Christian Kadner's avatar
      [SPARK-11338] [WEBUI] Prepend app links on HistoryPage with uiRoot path · dc7e399f
      Christian Kadner authored
      [SPARK-11338: HistoryPage not multi-tenancy enabled ...](https://issues.apache.org/jira/browse/SPARK-11338)
      - `HistoryPage.scala` ...prepending all page links with the web proxy (`uiRoot`) path
      - `HistoryServerSuite.scala` ...adding a test case to verify all site-relative links are prefixed when the environment variable `APPLICATION_WEB_PROXY_BASE` (or System property `spark.ui.proxyBase`) is set
      
      Author: Christian Kadner <ckadner@us.ibm.com>
      
      Closes #9291 from ckadner/SPARK-11338 and squashes the following commits:
      
      01d2f35 [Christian Kadner] [SPARK-11338][WebUI] nit fixes
      d054bd7 [Christian Kadner] [SPARK-11338][WebUI] prependBaseUri in method makePageLink
      8bcb3dc [Christian Kadner] [SPARK-11338][WebUI] Prepend application links on HistoryPage with uiRoot path
      dc7e399f
    • Sean Owen's avatar
      [SPARK-11305][DOCS] Remove Third-Party Hadoop Distributions Doc Page · 643c49c7
      Sean Owen authored
      Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page
      
      CC pwendell
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #9298 from srowen/SPARK-11305.
      643c49c7
  3. Oct 31, 2015
    • Cheng Lian's avatar
      [SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources produce UnsafeRow · aa494a9c
      Cheng Lian authored
      This PR fixes two issues:
      
      1.  `PhysicalRDD.outputsUnsafeRows` is always `false`
      
          Thus a `ConvertToUnsafe` operator is often required even if the underlying data source relation does output `UnsafeRow`.
      
      1.  Internal/external row conversion for `HadoopFsRelation` is kinda messy
      
          Currently we're using `HadoopFsRelation.needConversion` and [dirty type erasure hacks][1] to indicate whether the relation outputs external row or internal row and apply external-to-internal conversion when necessary.  Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, ORC, and Text output `InternalRow`, while typical external `HadoopFsRelation` data sources, e.g. spark-avro and spark-csv, output `Row`.
      
      This PR adds a `private[sql]` interface method `HadoopFsRelation.buildInternalScan`, which by default invokes `HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are also `InternalRow`s).  All builtin `HadoopFsRelation` data sources override this method and directly output `UnsafeRow`s.  In this way, now `HadoopFsRelation` always produces `UnsafeRow`s. Thus `PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the underlying data source is a `HadoopFsRelation`.
      
      A remaining question is that, can we assume that all non-builtin `HadoopFsRelation` data sources output external rows?  At least all well known ones do so.  However it's possible that some users implemented their own `HadoopFsRelation` data sources that leverages `InternalRow` and thus all those unstable internal data representations.  If this assumption is safe, we can deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion code (like [here][2] and [here][3]).
      
      This PR supersedes #9125.
      
      Follow-ups:
      
      1.  Makes JSON and ORC data sources output `UnsafeRow` directly
      
      1.  Makes `HiveTableScan` output `UnsafeRow` directly
      
          This is related to 1 since ORC data source shares the same `Writable` unwrapping code with `HiveTableScan`.
      
      [1]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353
      [2]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335
      [3]: https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9305 from liancheng/spark-11345.unsafe-hadoop-fs-relation.
      aa494a9c
    • Steve Loughran's avatar
      [SPARK-11265][YARN] YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster · 40d3c679
      Steve Loughran authored
      This is a fix for SPARK-11265; the introspection code to get Hive delegation tokens failing on Spark 1.5.1+, due to changes in the Hive codebase
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #9232 from steveloughran/stevel/patches/SPARK-11265-hive-tokens.
      40d3c679
    • Dilip Biswal's avatar
      [SPARK-11024][SQL] Optimize NULL in <inlist-expressions> by folding it to Literal(null) · fc27dfbf
      Dilip Biswal authored
      Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
      Literal(null).
      
      This is a follow up defect to SPARK-8654
      
      cloud-fan Can you please take a look ?
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #9348 from dilipbiswal/spark_11024.
      fc27dfbf
    • Josh Rosen's avatar
      [SPARK-11424] Guard against double-close() of RecordReaders · ac4118db
      Josh Rosen authored
      **TL;DR**: We can rule out one rare but potential cause of input stream corruption via defensive programming.
      
      ## Background
      
      [MAPREDUCE-5918](https://issues.apache.org/jira/browse/MAPREDUCE-5918) is a bug where an instance of a decompressor ends up getting placed into a pool multiple times. Since the pool is backed by a list instead of a set, this can lead to the same decompressor being used in different places at the same time, which is not safe because those decompressors will overwrite each other's buffers. Sometimes this buffer sharing will lead to exceptions but other times it will might silently result in invalid / garbled input.
      
      That Hadoop bug is fixed in Hadoop 2.7 but is still present in many Hadoop versions that we wish to support. As a result, I think that we should try to work around this issue in Spark via defensive programming to prevent RecordReaders from being closed multiple times.
      
      So far, I've had a hard time coming up with explanations of exactly how double-`close()`s occur in practice, but I do have a couple of explanations that work on paper.
      
      For instance, it looks like https://github.com/apache/spark/pull/7424, added in 1.5, introduces at least one extremely~rare corner-case path where Spark could double-close() a LineRecordReader instance in a way that triggers the bug. Here are the steps involved in the bad execution that I brainstormed up:
      
      * [The task has finished reading input, so we call close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L168).
      * [While handling the close call and trying to close the reader, reader.close() throws an exception]( https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L190)
      * We don't set `reader = null` after handling this exception, so the [TaskCompletionListener also ends up calling NewHadoopRDD.close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L156), which, in turn, closes the record reader again.
      
      In this hypothetical situation, `LineRecordReader.close()` could [fail with an exception if its InputStream failed to close](https://github.com/apache/hadoop/blob/release-1.2.1/src/mapred/org/apache/hadoop/mapred/LineRecordReader.java#L212).
      I googled for "Exception in RecordReader.close()" and it looks like it's possible for a closed Hadoop FileSystem to trigger an error there: [SPARK-757](https://issues.apache.org/jira/browse/SPARK-757), [SPARK-2491](https://issues.apache.org/jira/browse/SPARK-2491)
      
      Looking at [SPARK-3052](https://issues.apache.org/jira/browse/SPARK-3052), it seems like it's possible to get spurious exceptions there when there is an error reading from Hadoop. If the Hadoop FileSystem were to get into an error state _right_ after reading the last record then it looks like we could hit the bug here in 1.5.
      
      ## The fix
      
      This patch guards against these issues by modifying `HadoopRDD.close()` and `NewHadoopRDD.close()` so that they set `reader = null` even if an exception occurs in the `reader.close()` call. In addition, I modified `NextIterator. closeIfNeeded()` to guard against double-close if the first `close()` call throws an exception.
      
      I don't have an easy way to test this, since I haven't been able to reproduce the bug that prompted this patch, but these changes seem safe and seem to rule out the on-paper reproductions that I was able to brainstorm up.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9382 from JoshRosen/hadoop-decompressor-pooling-fix and squashes the following commits:
      
      5ec97d7 [Josh Rosen] Add SqlNewHadoopRDD.unsetInputFileName() that I accidentally deleted.
      ae46cf4 [Josh Rosen] Merge remote-tracking branch 'origin/master' into hadoop-decompressor-pooling-fix
      087aa63 [Josh Rosen] Guard against double-close() of RecordReaders.
      ac4118db
    • Jeff Zhang's avatar
      [SPARK-11226][SQL] Empty line in json file should be skipped · 97b3c8fb
      Jeff Zhang authored
      Currently the empty line in json file will be parsed into Row with all null field values. But in json, "{}" represents a json object, empty line is supposed to be skipped.
      
      Make a trivial change for this.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9211 from zjffdu/SPARK-11226.
      97b3c8fb
  4. Oct 30, 2015
Loading