Skip to content
Snippets Groups Projects
  1. Oct 26, 2015
  2. Oct 25, 2015
    • Xiangrui Meng's avatar
      [SPARK-11127][STREAMING] upgrade AWS SDK and Kinesis Client Library (KCL) · 87f82a5f
      Xiangrui Meng authored
      AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions.
      
      tdas brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9153 from mengxr/SPARK-11127.
      87f82a5f
    • Josh Rosen's avatar
      [SPARK-10984] Simplify *MemoryManager class structure · 85e654c5
      Josh Rosen authored
      This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:
      
      - MemoryManager
      - StaticMemoryManager
      - ExecutorMemoryManager
      - TaskMemoryManager
      - ShuffleMemoryManager
      
      This is fairly confusing. To simplify things, this patch consolidates several of these classes:
      
      - ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
      - TaskMemoryManager is moved into Spark Core.
      
      **Key changes and tasks**:
      
      - [x] Merge ExecutorMemoryManager into MemoryManager.
        - [x] Move pooling logic into Allocator.
      - [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
      - [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
      - [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
      - [x] Merge ShuffleMemoryManager into MemoryManager.
        - [x] Move code
        - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
      - [x] Port ShuffleMemoryManagerSuite tests.
      - [x] Move classes from `unsafe` package to `memory` package.
      - [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
      - [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
        - [x] AbstractBytesToBytesMapSuite
        - [x] UnsafeExternalSorterSuite
        - [x] UnsafeFixedWidthAggregationMapSuite
        - [x] UnsafeKVExternalSorterSuite
      
      **Compatiblity notes**:
      
      - This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9127 from JoshRosen/SPARK-10984.
      85e654c5
    • Burak Yavuz's avatar
      [SPARK-10891][STREAMING][KINESIS] Add MessageHandler to... · 63accc79
      Burak Yavuz authored
      [SPARK-10891][STREAMING][KINESIS] Add MessageHandler to KinesisUtils.createStream similar to Direct Kafka
      
      This PR allows users to map a Kinesis `Record` to a generic `T` when creating a Kinesis stream. This is particularly useful, if you would like to do extra work with Kinesis metadata such as sequence number, and partition key.
      
      TODO:
       - [x] add tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #8954 from brkyvz/kinesis-handler.
      63accc79
    • Bryan Cutler's avatar
      [SPARK-11287] Fixed class name to properly start TestExecutor from deploy.client.TestClient · 80279ac1
      Bryan Cutler authored
      Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287.
      80279ac1
    • Alexander Slesarenko's avatar
      [SPARK-6428][SQL] Removed unnecessary typecasts in MutableInt, MutableDouble etc. · 92b9c5ed
      Alexander Slesarenko authored
      marmbrus rxin I believe these typecasts are not required in the presence of explicit return types.
      
      Author: Alexander Slesarenko <avslesarenko@gmail.com>
      
      Closes #9262 from aslesarenko/remove-typecasts.
      92b9c5ed
    • Josh Rosen's avatar
      [SPARK-11299][DOC] Fix link to Scala DataFrame Functions reference · b67dc6a4
      Josh Rosen authored
      The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9269 from JoshRosen/SPARK-11299.
      b67dc6a4
  3. Oct 24, 2015
    • Jacek Laskowski's avatar
      Fix typos · 146da0d8
      Jacek Laskowski authored
      Two typos squashed.
      
      BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me.
      
      Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
      
      Closes #9250 from jaceklaskowski/typos-hunting.
      146da0d8
    • Jeffrey Naisbitt's avatar
      [SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set · 28132ceb
      Jeffrey Naisbitt authored
      Temporarily remove GREP_OPTIONS if set in bin/spark-class.
      
      Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars.
      For example, if the -n option is specified, the grep output will look like:
      5:spark-assembly-1.5.1-hadoop2.4.0.jar
      
      This will not match the regular expressions, and so the jar files will not be found.  We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output.
      
      Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org>
      
      Closes #9231 from naisbitt/unset-GREP_OPTIONS.
      28132ceb
    • dima's avatar
      [SPARK-11245] update twitter4j to 4.0.4 version · e5bc8c27
      dima authored
      update twitter4j to 4.0.4 version
      https://issues.apache.org/jira/browse/SPARK-11245
      
      Author: dima <pronix.service@gmail.com>
      
      Closes #9221 from pronix/twitter4j_update.
      e5bc8c27
    • Jeff Zhang's avatar
      [SPARK-11125] [SQL] Uninformative exception when running spark-sql witho… · ffed0049
      Jeff Zhang authored
      …ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
      
      This is the exception after this patch. Please help review.
      ```
      java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
      	at java.lang.ClassLoader.defineClass1(Native Method)
      	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
      	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
      	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	at java.lang.Class.forName0(Native Method)
      	at java.lang.Class.forName(Class.java:270)
      	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	... 21 more
      Failed to load hive class.
      You need to build Spark with -Phive and -Phive-thriftserver.
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9134 from zjffdu/SPARK-11125.
      ffed0049
  4. Oct 23, 2015
  5. Oct 22, 2015
    • zsxwing's avatar
      [SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the message disorder issue · a88c66ca
      zsxwing authored
      The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread,
      
      ```
      ref.send("A")
      ref.send("B")
      ```
      
      The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel.
      To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9197 from zsxwing/rpc-outbox.
      a88c66ca
    • Andrew Or's avatar
      [SPARK-11251] Fix page size calculation in local mode · 34e71c6d
      Andrew Or authored
      ```
      // My machine only has 8 cores
      $ bin/spark-shell --master local[32]
      scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
      scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
      
      Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
      	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
      ```
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9209 from andrewor14/fix-local-page-size.
      34e71c6d
    • Gábor Lipták's avatar
      [SPARK-7021] Add JUnit output for Python unit tests · 163d53e8
      Gábor Lipták authored
      WIP
      
      Author: Gábor Lipták <gliptak@gmail.com>
      
      Closes #8323 from gliptak/SPARK-7021.
      163d53e8
    • Michael Armbrust's avatar
      [SPARK-11116][SQL] First Draft of Dataset API · 53e83a3a
      Michael Armbrust authored
      *This PR adds a new experimental API to Spark, tentitively named Datasets.*
      
      A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations.  Example usage is as follows:
      
      ### Functional
      ```scala
      > val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
      > ds.filter(_ % 1 == 0).collect()
      res1: Array[Int] = Array(1, 2, 3)
      ```
      
      ### Relational
      ```scala
      scala> ds.toDF().show()
      +-----+
      |value|
      +-----+
      |    1|
      |    2|
      |    3|
      +-----+
      
      > ds.select(expr("value + 1").as[Int]).collect()
      res11: Array[Int] = Array(2, 3, 4)
      ```
      
      ## Comparison to RDDs
       A `Dataset` differs from an `RDD` in the following ways:
        - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be
          used to serialize the object into a binary format.  Encoders are also capable of mapping the
          schema of a given object to the Spark SQL type system.  In contrast, RDDs rely on runtime
          reflection based serialization.
        - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored
          in the encoded form.  This representation allows for additional logical operations and
          enables many operations (sorting, shuffling, etc.) to be performed without deserializing to
          an object.
      
      A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.
      
      ## Comparison to DataFrames
      
      A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific
      JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into
      specific Dataset by calling `df.as[ElementType]`.  Similarly you can transform a strongly-typed
      `Dataset` to a generic DataFrame by calling `ds.toDF()`.
      
      ## Implementation Status and TODOs
      
      This is a rough cut at the least controversial parts of the API.  The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API.  The following is being deferred to future PRs:
       - Joins and Aggregations (prototype here https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937)
       - Support for Java
      
      Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion.  This is an internal detail, and what we are doing today works for the cases we care about.  However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames).
      
      ## COMPATIBILITY NOTE
      Long term we plan to make `DataFrame` extend `Dataset[Row]`.  However,
      making this change to che class hierarchy would break the function signatures for the existing
      function operations (map, flatMap, etc).  As such, this class should be considered a preview
      of the final API.  Changes will be made to the interface after Spark 1.6.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9190 from marmbrus/dataset-infra.
      53e83a3a
    • guoxi's avatar
      [SPARK-11242][SQL] In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly · 188ea348
      guoxi authored
      Minor fix on the comment
      
      Author: guoxi <guoxi@us.ibm.com>
      
      Closes #9201 from xguo27/SPARK-11242.
      188ea348
    • Cheng Hao's avatar
      [SPARK-9735][SQL] Respect the user specified schema than the infer partition... · d4950e6b
      Cheng Hao authored
      [SPARK-9735][SQL] Respect the user specified schema than the infer partition schema for HadoopFsRelation
      
      To enable the unit test of `hadoopFsRelationSuite.Partition column type casting`. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one.
      
      ```
      java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
      	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
      	at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 in stage 3.0 (TID 206)
      java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
      	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
      	at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8026 from chenghao-intel/partition_discovery.
      d4950e6b
    • Kay Ousterhout's avatar
      [SPARK-11163] Remove unnecessary addPendingTask calls. · 3535b91d
      Kay Ousterhout authored
      This commit removes unnecessary calls to addPendingTask in
      TaskSetManager.executorLost. These calls are unnecessary: for
      tasks that are still pending and haven't been launched, they're
      still in all of the correct pending lists, so calling addPendingTask
      has no effect. For tasks that are currently running (which may still be
      in the pending lists, depending on how they were scheduled), we call
      addPendingTask in handleFailedTask, so the calls at the beginning
      of executorLost are redundant.
      
      I think these calls are left over from when we re-computed the locality
      levels in addPendingTask; now that we call recomputeLocality separately,
      I don't think these are necessary.
      
      Now that those calls are removed, the readding parameter in addPendingTask
      is no longer necessary, so this commit also removes that parameter.
      
      markhamstra can you take a look at this?
      
      cc vanzin
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9154 from kayousterhout/SPARK-11163.
      3535b91d
Loading