Skip to content
Snippets Groups Projects
  1. Nov 02, 2015
  2. Nov 01, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-9298][SQL] Add pearson correlation aggregation function · 3e770a64
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9298
      
      This patch adds pearson correlation aggregation function based on `AggregateExpression2`.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8587 from viirya/corr_aggregation.
      3e770a64
    • Nong Li's avatar
      [SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's DISTRIBUTE BY and SORT BY. · 046e32ed
      Nong Li authored
      DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for
      optioning sorting within each resulting partition. There is no required relationship between the
      exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other).
      
      This patch adds to APIs to DataFrames which can be used together to provide this functionality:
        1. distributeBy() which partitions the data frame into a specified number of partitions using the
           partitioning exprs.
        2. localSort() which sorts each partition using the provided sorting exprs.
      
      To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...)
      
      Author: Nong Li <nongli@gmail.com>
      
      Closes #9364 from nongli/spark-11410.
      046e32ed
  3. Oct 31, 2015
  4. Oct 30, 2015
    • Yin Huai's avatar
      [SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet... · 3c471885
      Yin Huai authored
      [SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet schema with new column fails"
      
      https://issues.apache.org/jira/browse/SPARK-11434
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9387 from yhuai/SPARK-11434.
      3c471885
    • Davies Liu's avatar
      [SPARK-11423] remove MapPartitionsWithPreparationRDD · 45029bfd
      Davies Liu authored
      Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.
      
      This PR basically revert #8543, #8511, #8038, #8011
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9381 from davies/remove_prepare2.
      45029bfd
    • Wenchen Fan's avatar
      [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that... · 14d08b99
      Wenchen Fan authored
      [SPARK-11393] [SQL] CoGroupedIterator should respect the fact that GroupedIterator.hasNext is not idempotent
      
      When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9346 from cloud-fan/cogroup and squashes the following commits:
      
      9be67c8 [Wenchen Fan] SPARK-11393
      14d08b99
    • hyukjinkwon's avatar
      [SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail · 59db9e9c
      hyukjinkwon authored
      When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
      This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).
      
      For now, it just simply disables predicate push down when using merged schema in this PR.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #9327 from HyukjinKwon/SPARK-11103.
      59db9e9c
    • Davies Liu's avatar
      [SPARK-11417] [SQL] no @Override in codegen · eb59b94c
      Davies Liu authored
      Older version of Janino (>2.7) does not support Override, we should not use that in codegen.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9372 from davies/no_override.
      eb59b94c
    • Davies Liu's avatar
      [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management · 56419cf1
      Davies Liu authored
      This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.
      
      Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).
      
      The PrepareRDD may be not needed anymore, could be removed in follow up PR.
      
      The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).
      
      ```python
      sqlContext.setConf("spark.sql.shuffle.partitions", "1")
      df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
      df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
      j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
      j.explain()
      print j.count()
      ```
      
      For thread-safety, here what I'm got:
      
      1) Without calling spill(), the operators should only be used by single thread, no safety problems.
      
      2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.
      
      3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.
      
      4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.
      
      5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9241 from davies/force_spill.
      56419cf1
  5. Oct 29, 2015
  6. Oct 28, 2015
  7. Oct 27, 2015
    • Cheng Hao's avatar
      [SPARK-10484] [SQL] Optimize the cartesian join with broadcast join for some cases · d9c60398
      Cheng Hao authored
      In some cases, we can broadcast the smaller relation in cartesian join, which improve the performance significantly.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8652 from chenghao-intel/cartesian.
      d9c60398
    • Michael Armbrust's avatar
      [SPARK-11347] [SQL] Support for joinWith in Datasets · 5a5f6590
      Michael Armbrust authored
      This PR adds a new operation `joinWith` to a `Dataset`, which returns a `Tuple` for each pair where a given `condition` evaluates to true.
      
      ```scala
      case class ClassData(a: String, b: Int)
      
      val ds1 = Seq(ClassData("a", 1), ClassData("b", 2)).toDS()
      val ds2 = Seq(("a", 1), ("b", 2)).toDS()
      
      > ds1.joinWith(ds2, $"_1" === $"a").collect()
      res0: Array((ClassData("a", 1), ("a", 1)), (ClassData("b", 2), ("b", 2)))
      ```
      
      This operation is similar to the relation `join` function with one important difference in the result schema. Since `joinWith` preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names `_1` and `_2`.
      
      This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
      
      ## Required Changes to Encoders
      In the process of working on this patch, several deficiencies to the way that we were handling encoders were discovered.  Specifically, it turned out to be very difficult to `rebind` the non-expression based encoders to extract the nested objects from the results of joins (and also typed selects that return tuples).
      
      As a result the following changes were made.
       - `ClassEncoder` has been renamed to `ExpressionEncoder` and has been improved to also handle primitive types.  Additionally, it is now possible to take arbitrary expression encoders and rewrite them into a single encoder that returns a tuple.
       - All internal operations on `Dataset`s now require an `ExpressionEncoder`.  If the users tries to pass a non-`ExpressionEncoder` in, an error will be thrown.  We can relax this requirement in the future by constructing a wrapper class that uses expressions to project the row to the expected schema, shielding the users code from the required remapping.  This will give us a nice balance where we don't force user encoders to understand attribute references and binding, but still allow our native encoder to leverage runtime code generation to construct specific encoders for a given schema that avoid an extra remapping step.
       - Additionally, the semantics for different types of objects are now better defined.  As stated in the `ExpressionEncoder` scaladoc:
        - Classes will have their sub fields extracted by name using `UnresolvedAttribute` expressions
        and `UnresolvedExtractValue` expressions.
        - Tuples will have their subfields extracted by position using `BoundReference` expressions.
        - Primitives will have their values extracted from the first ordinal with a schema that defaults
        to the name `value`.
       - Finally, the binding lifecycle for `Encoders` has now been unified across the codebase.  Encoders are now `resolved` to the appropriate schema in the constructor of `Dataset`.  This process replaces an unresolved expressions with concrete `AttributeReference` expressions.  Binding then happens on demand, when an encoder is going to be used to construct an object.  This closely mirrors the lifecycle for standard expressions when executing normal SQL or `DataFrame` queries.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9300 from marmbrus/datasets-tuples.
      5a5f6590
    • Yanbo Liang's avatar
      [SPARK-11303][SQL] filter should not be pushed down into sample · 360ed832
      Yanbo Liang authored
      When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9294 from yanboliang/spark-11303.
      360ed832
    • Jia Li's avatar
      [SPARK-11277][SQL] sort_array throws exception scala.MatchError · 958a0ec8
      Jia Li authored
      I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway.
      
      I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this.
      
      Please help to review my fix. Thanks!
      
      Author: Jia Li <jiali@us.ibm.com>
      
      Closes #9247 from jliwork/SPARK-11277.
      958a0ec8
  8. Oct 26, 2015
  9. Oct 25, 2015
    • Josh Rosen's avatar
      [SPARK-10984] Simplify *MemoryManager class structure · 85e654c5
      Josh Rosen authored
      This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:
      
      - MemoryManager
      - StaticMemoryManager
      - ExecutorMemoryManager
      - TaskMemoryManager
      - ShuffleMemoryManager
      
      This is fairly confusing. To simplify things, this patch consolidates several of these classes:
      
      - ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
      - TaskMemoryManager is moved into Spark Core.
      
      **Key changes and tasks**:
      
      - [x] Merge ExecutorMemoryManager into MemoryManager.
        - [x] Move pooling logic into Allocator.
      - [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
      - [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
      - [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
      - [x] Merge ShuffleMemoryManager into MemoryManager.
        - [x] Move code
        - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
      - [x] Port ShuffleMemoryManagerSuite tests.
      - [x] Move classes from `unsafe` package to `memory` package.
      - [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
      - [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
        - [x] AbstractBytesToBytesMapSuite
        - [x] UnsafeExternalSorterSuite
        - [x] UnsafeFixedWidthAggregationMapSuite
        - [x] UnsafeKVExternalSorterSuite
      
      **Compatiblity notes**:
      
      - This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9127 from JoshRosen/SPARK-10984.
      85e654c5
    • Alexander Slesarenko's avatar
      [SPARK-6428][SQL] Removed unnecessary typecasts in MutableInt, MutableDouble etc. · 92b9c5ed
      Alexander Slesarenko authored
      marmbrus rxin I believe these typecasts are not required in the presence of explicit return types.
      
      Author: Alexander Slesarenko <avslesarenko@gmail.com>
      
      Closes #9262 from aslesarenko/remove-typecasts.
      92b9c5ed
Loading