Skip to content
Snippets Groups Projects
  1. Sep 12, 2015
    • JihongMa's avatar
      [SPARK-6548] Adding stddev to DataFrame functions · f4a22808
      JihongMa authored
      Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
      
      Author: JihongMa <linlin200605@gmail.com>
      Author: Jihong MA <linlin200605@gmail.com>
      Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
      Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
      
      Closes #6297 from JihongMA/SPARK-SQL.
      f4a22808
    • Sean Owen's avatar
      [SPARK-10547] [TEST] Streamline / improve style of Java API tests · 22730ad5
      Sean Owen authored
      Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8706 from srowen/SPARK-10547.
      22730ad5
    • Nithin Asokan's avatar
      [SPARK-10554] [CORE] Fix NPE with ShutdownHook · 8285e3b0
      Nithin Asokan authored
      https://issues.apache.org/jira/browse/SPARK-10554
      
      Fixes NPE when ShutdownHook tries to cleanup temporary folders
      
      Author: Nithin Asokan <Nithin.Asokan@Cerner.com>
      
      Closes #8720 from nasokan/SPARK-10554.
      8285e3b0
    • Daniel Imfeld's avatar
      [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks... · 6d836780
      Daniel Imfeld authored
      [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks important error information
      
      When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.
      
      Manual testing shows the exception chained properly, and the test suite still looks fine as well.
      
      This contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Daniel Imfeld <daniel@danielimfeld.com>
      
      Closes #8725 from dimfeld/dimfeld-patch-1.
      6d836780
  2. Sep 11, 2015
  3. Sep 10, 2015
    • Yanbo Liang's avatar
      [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature · a140dd77
      Yanbo Liang authored
      Missing method of ml.feature are listed here:
      ```StringIndexer``` lacks of parameter ```handleInvalid```.
      ```StringIndexerModel``` lacks of method ```labels```.
      ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8313 from yanboliang/spark-10027.
      a140dd77
    • Yanbo Liang's avatar
      [SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval... · 339a5271
      Yanbo Liang authored
      [SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval between Scala and Python API.
      
      "checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
      ```
      member of DecisionTreeParams <-> Scala API
      shared param for all ML Transformer/Estimator <-> Python API
      ```
      Proposal:
      "checkpointInterval" is also used by ALS, so we make it shared params at Scala.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8528 from yanboliang/spark-10023.
      339a5271
    • Matt Massie's avatar
      [SPARK-9043] Serialize key, value and combiner classes in ShuffleDependency · 0eabea8a
      Matt Massie authored
      ShuffleManager implementations are currently not given type information for
      the key, value and combiner classes. Serialization of shuffle objects relies
      on objects being JavaSerializable, with methods defined for reading/writing
      the object or, alternatively, serialization via Kryo which uses reflection.
      
      Serialization systems like Avro, Thrift and Protobuf generate classes with
      zero argument constructors and explicit schema information
      (e.g. IndexedRecords in Avro have get, put and getSchema methods).
      
      By serializing the key, value and combiner class names in ShuffleDependency,
      shuffle implementations will have access to schema information when
      registerShuffle() is called.
      
      Author: Matt Massie <massie@cs.berkeley.edu>
      
      Closes #7403 from massie/shuffle-classtags.
      0eabea8a
    • Yanbo Liang's avatar
      [SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__ · 89562a17
      Yanbo Liang authored
      pyspark.sql.types.Row implements ```__getitem__```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8333 from yanboliang/spark-7544.
      89562a17
    • Shivaram Venkataraman's avatar
      Add 1.5 to master branch EC2 scripts · 42047577
      Shivaram Venkataraman authored
      This change brings it to par with `branch-1.5` (and 1.5.0 release)
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #8704 from shivaram/ec2-1.5-update.
      42047577
    • Andrew Or's avatar
      [SPARK-10443] [SQL] Refactor SortMergeOuterJoin to reduce duplication · 3db72554
      Andrew Or authored
      `LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8596 from andrewor14/smoj-cleanup.
      3db72554
    • Sun Rui's avatar
      [SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame. · 45e3be5c
      Sun Rui authored
      this PR :
      1.  Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
      
      2.  Enhance the SerDe to support transferring  a Scala seq to R side. Data of ArrayType in DataFrame
      after collection is observed to be of Scala Seq type.
      
      3.  Support ArrayType in createDataFrame().
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8458 from sun-rui/SPARK-10049.
      45e3be5c
    • zsxwing's avatar
      [SPARK-9990] [SQL] Create local hash join operator · d88abb7e
      zsxwing authored
      This PR includes the following changes:
      - Add SQLConf to LocalNode
      - Add HashJoinNode
      - Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8535 from zsxwing/SPARK-9990.
      d88abb7e
    • Akash Mishra's avatar
      [SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by... · a5ef2d06
      Akash Mishra authored
      [SPARK-10514] [MESOS] waiting for min no of total cores acquired by Spark by implementing the sufficientResourcesRegistered method
      
      spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
      
      If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
      
      There are no existing test for YARN mode too. Hence not added test for the same.
      
      Author: Akash Mishra <akash.mishra20@gmail.com>
      
      Closes #8672 from SleepyThread/master.
      a5ef2d06
    • Iulian Dragos's avatar
      [SPARK-6350] [MESOS] Fine-grained mode scheduler respects mesosExecutor.cores · f0562e8c
      Iulian Dragos authored
      This is a regression introduced in #4960, this commit fixes it and adds a test.
      
      tnachen andrewor14 please review, this should be an easy one.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #8653 from dragos/issue/mesos/fine-grained-maxExecutorCores.
      f0562e8c
    • mcheah's avatar
      [SPARK-8167] Make tasks that fail from YARN preemption not fail job · af3bc59d
      mcheah authored
      The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor
      is still removed from the driver's list of available executors, however.
      
      There's a few open questions:
      1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well.
      2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however.
      
      cc vanzin specifically for review because it collided with some earlier YARN scheduling work.
      cc JoshRosen because it's similar to output commit coordination we did in the past
      cc andrewor14 for our discussion on how to get executor exit codes and loss reasons
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #8007 from mccheah/feature/preemption-handling.
      af3bc59d
    • Holden Karau's avatar
      [SPARK-10469] [DOC] Try and document the three options · a76bde9d
      Holden Karau authored
      From JIRA:
      Add documentation for tungsten-sort.
      From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
      https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
      corresponding description in
      http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
      there are only 'sort' and 'hash' two options)."
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8638 from holdenk/SPARK-10469-document-tungsten-sort.
      a76bde9d
    • Cheng Hao's avatar
      [SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill · e0481113
      Cheng Hao authored
      Data Spill with UnsafeRow causes assert failure.
      
      ```
      java.lang.AssertionError: assertion failed
      	at scala.Predef$.assert(Predef.scala:165)
      	at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
      	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
      	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
      	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
      	at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
      	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      ```
      
      To reproduce that with code (thanks andrewor14):
      ```scala
      bin/spark-shell --master local
        --conf spark.shuffle.memoryFraction=0.005
        --conf spark.shuffle.sort.bypassMergeThreshold=0
      
      sc.parallelize(1 to 2 * 1000 * 1000, 10)
        .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8635 from chenghao-intel/unsafe_spill.
      e0481113
    • Cheng Lian's avatar
      [SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #8509 for master · 49da38e5
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8670 from liancheng/spark-10301/address-pr-comments.
      49da38e5
    • Yash Datta's avatar
      [SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule · f892d927
      Yash Datta authored
      Use these in the optimizer as well:
      
                  A and (not(A) or B) => A and B
                  not(A and B) => not(A) or not(B)
                  not(A or B) => not(A) and not(B)
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #5700 from saucam/bool_simp.
      f892d927
    • Wenchen Fan's avatar
      [SPARK-10065] [SQL] avoid the extra copy when generate unsafe array · 4f1daa1e
      Wenchen Fan authored
      The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer.
      
      A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead.
      
      This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8496 from cloud-fan/avoid-copy.
      4f1daa1e
    • Holden Karau's avatar
      [SPARK-10497] [BUILD] [TRIVIAL] Handle both locations for JIRAError with python-jira · 48817cc1
      Holden Karau authored
      Location of JIRAError has moved between old and new versions of python-jira package.
      Longer term it probably makes sense to pin to specific versions (as mentioned in https://issues.apache.org/jira/browse/SPARK-10498 ) but for now, making release tools works with both new and old versions of python-jira.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8661 from holdenk/SPARK-10497-release-utils-does-not-work-with-new-jira-python.
      48817cc1
    • Sean Paradiso's avatar
      [MINOR] [MLLIB] [ML] [DOC] fixed typo: label for negative result should be 0.0 (original: 1.0) · 1dc7548c
      Sean Paradiso authored
      Small typo in the example for `LabelledPoint` in the MLLib docs.
      
      Author: Sean Paradiso <seanparadiso@gmail.com>
      
      Closes #8680 from sparadiso/docs_mllib_smalltypo.
      1dc7548c
  4. Sep 09, 2015
Loading