Skip to content
Snippets Groups Projects
  1. Jul 31, 2014
    • Prashant Sharma's avatar
      [SPARK-2497] Included checks for module symbols too. · 5a110da2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1463 from ScrapCodes/SPARK-2497/mima-exclude-all and squashes the following commits:
      
      72077b1 [Prashant Sharma] Check separately for module symbols.
      cd96192 [Prashant Sharma] SPARK-2497 Produce "member excludes" irrespective of the fact that class itself is excluded or not.
      5a110da2
    • Josh Rosen's avatar
      [SPARK-2737] Add retag() method for changing RDDs' ClassTags. · 4fb25935
      Josh Rosen authored
      The Java API's use of fake ClassTags doesn't seem to cause any problems for Java users, but it can lead to issues when passing JavaRDDs' underlying RDDs to Scala code (e.g. in the MLlib Java API wrapper code). If we call collect() on a Scala RDD with an incorrect ClassTag, this causes ClassCastExceptions when we try to allocate an array of the wrong type (for example, see SPARK-2197).
      
      There are a few possible fixes here. An API-breaking fix would be to completely remove the fake ClassTags and require Java API users to pass java.lang.Class instances to all parallelize() calls and add returnClass fields to all Function implementations. This would be extremely verbose.
      
      Instead, this patch adds internal APIs to "repair" a Scala RDD with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1639 from JoshRosen/SPARK-2737 and squashes the following commits:
      
      572b4c8 [Josh Rosen] Replace newRDD[T] with mapPartitions().
      469d941 [Josh Rosen] Preserve partitioner in retag().
      af78816 [Josh Rosen] Allow retag() to get classTag implicitly.
      d1d54e6 [Josh Rosen] [SPARK-2737] Add retag() method for changing RDDs' ClassTags.
      4fb25935
  2. Jul 30, 2014
    • Andrew Or's avatar
      [SPARK-2340] Resolve event logging and History Server paths properly · a7c305b8
      Andrew Or authored
      We resolve relative paths to the local `file:/` system for `--jars` and `--files` in spark submit (#853). We should do the same for the history server.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1280 from andrewor14/hist-serv-fix and squashes the following commits:
      
      13ff406 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hist-serv-fix
      b393e17 [Andrew Or] Strip trailing "/" from logging directory
      622a471 [Andrew Or] Fix test in EventLoggingListenerSuite
      0e20f71 [Andrew Or] Shift responsibility of resolving paths up one level
      b037c0c [Andrew Or] Use resolved paths for everything in history server
      c7e36ee [Andrew Or] Resolve paths for event logging too
      40e3933 [Andrew Or] Resolve history server file paths
      a7c305b8
    • derek ma's avatar
      Required AM memory is "amMem", not "args.amMemory" · 118c1c42
      derek ma authored
      "ERROR yarn.Client: Required AM memory (1024) is above the max threshold (1048) of this cluster" appears if this code is not changed. obviously, 1024 is less than 1048, so change this
      
      Author: derek ma <maji3@asiainfo-linkage.com>
      
      Closes #1494 from maji2014/master and squashes the following commits:
      
      b0f6640 [derek ma] Required AM memory is "amMem", not "args.amMemory"
      118c1c42
    • Reynold Xin's avatar
      [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs · 894d48ff
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1675 from rxin/unionrdd and squashes the following commits:
      
      941d316 [Reynold Xin] Clear RDDs for checkpointing.
      c9f05f2 [Reynold Xin] [SPARK-2758] UnionRDD's UnionPartition should not reference parent RDDs
      894d48ff
    • Matei Zaharia's avatar
      SPARK-2045 Sort-based shuffle · e9662844
      Matei Zaharia authored
      This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)
      
      The main TODOs still left are:
      - [x] enabling ExternalSorter to merge across spilled files
        - [x] with an Ordering
        - [x] without an Ordering, using the keys' hash codes
      - [x] adding more tests (e.g. a version of our shuffle suite that runs on this)
      - [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged
      - [x] disabling spilling if spark.shuffle.spill is set to false
      
      Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.
      
      After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits:
      
      bd841f9 [Matei Zaharia] Various review comments
      d1c137fd [Matei Zaharia] Various review comments
      a611159 [Matei Zaharia] Compile fixes due to rebase
      62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s.
      f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic)
      9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase
      0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle
      eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD
      fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams
      a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test
      03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle
      3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter
      ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer
      44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes
      5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data:
      5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition)
      e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it)
      c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty
      de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark]
      4988d16 [Matei Zaharia] tweak
      c1b7572 [Matei Zaharia] Small optimization
      ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions
      ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering
      4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given
      e1f84be [Matei Zaharia] Fix disk block manager test
      5a40a1c [Matei Zaharia] More tests
      614f1b4 [Matei Zaharia] Add spill metrics to map tasks
      cc52caf [Matei Zaharia] Add more error handling and tests for error cases
      bbf359d [Matei Zaharia] More work
      3a56341 [Matei Zaharia] More partial work towards sort-based shuffle
      7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle
      b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
      e9662844
    • strat0sphere's avatar
      Update DecisionTreeRunner.scala · da501766
      strat0sphere authored
      Author: strat0sphere <stratos.dimopoulos@gmail.com>
      
      Closes #1676 from strat0sphere/patch-1 and squashes the following commits:
      
      044d2fa [strat0sphere] Update DecisionTreeRunner.scala
      da501766
    • Sean Owen's avatar
      SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets · e9b275b7
      Sean Owen authored
      Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1663 from srowen/SPARK-2341 and squashes the following commits:
      
      8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
      18a8c8e [Sean Owen] Updates from review
      83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
      e9b275b7
    • Michael Armbrust's avatar
      [SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run. · 88a519db
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1650 from marmbrus/dropCached and squashes the following commits:
      
      e6ab80b [Michael Armbrust] Support if exists.
      83426c6 [Michael Armbrust] Remove tables from cache when DROP TABLE is run.
      88a519db
    • Brock Noland's avatar
      SPARK-2741 - Publish version of spark assembly which does not contain Hive · 2ac37db7
      Brock Noland authored
      Provide a version of the Spark tarball which does not package Hive. This is meant for HIve + Spark users.
      
      Author: Brock Noland <brock@apache.org>
      
      Closes #1667 from brockn/master and squashes the following commits:
      
      5beafb2 [Brock Noland] SPARK-2741 - Publish version of spark assembly which does not contain Hive
      2ac37db7
    • Sean Owen's avatar
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven... · 6ab96a6f
      Sean Owen authored
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
      
      The Maven-based builds in the build matrix have been failing for a few days:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/
      
      On inspection, it looks like the Spark SQL Java tests don't compile:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
      
      I confirmed it by repeating the command vs master:
      
      `mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`
      
      The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`
      
      Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1660 from srowen/SPARK-2749 and squashes the following commits:
      
      858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
      9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
      6ab96a6f
    • Reynold Xin's avatar
      Properly pass SBT_MAVEN_PROFILES into sbt. · 2f4b1705
      Reynold Xin authored
      2f4b1705
    • Reynold Xin's avatar
      Set AMPLAB_JENKINS_BUILD_PROFILE. · 10973275
      Reynold Xin authored
      10973275
    • Reynold Xin's avatar
      Wrap JAR_DL in dev/check-license. · 7c7ce545
      Reynold Xin authored
      7c7ce545
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
    • Reynold Xin's avatar
      dev/check-license wrap folders in quotes. · 437dc8c5
      Reynold Xin authored
      437dc8c5
    • Michael Armbrust's avatar
      [SQL] Fix compiling of catalyst docs. · 2248891a
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1653 from marmbrus/fixDocs and squashes the following commits:
      
      0aa1feb [Michael Armbrust] Fix compiling of catalyst docs.
      2248891a
    • Reynold Xin's avatar
      More wrapping FWDIR in quotes. · 0feb349e
      Reynold Xin authored
      0feb349e
    • Reynold Xin's avatar
      Wrap FWDIR in quotes in dev/check-license. · 95cf2039
      Reynold Xin authored
      95cf2039
    • Reynold Xin's avatar
      Wrap FWDIR in quotes. · f2eb84fe
      Reynold Xin authored
      f2eb84fe
    • Reynold Xin's avatar
      [SPARK-2746] Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user. · ff511bac
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1655 from rxin/SBT_MAVEN_PROFILES and squashes the following commits:
      
      b268c4b [Reynold Xin] [SPARK-2746] Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user.
      ff511bac
    • GuoQiang Li's avatar
      [SPARK-2544][MLLIB] Improve ALS algorithm resource usage · fc47bb69
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      Author: witgo <witgo@qq.com>
      
      Closes #929 from witgo/improve_als and squashes the following commits:
      
      ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
      154dccf [GuoQiang Li] checkpoint products only
      c5779ff [witgo] Improve ALS algorithm resource usage
      fc47bb69
    • Naftali Harris's avatar
      Avoid numerical instability · e3d85b7e
      Naftali Harris authored
      This avoids basically doing 1 - 1, for example:
      
      ```python
      >>> from math import exp
      >>> margin = -40
      >>> 1 - 1 / (1 + exp(margin))
      0.0
      >>> exp(margin) / (1 + exp(margin))
      4.248354255291589e-18
      >>>
      ```
      
      Author: Naftali Harris <naftaliharris@gmail.com>
      
      Closes #1652 from naftaliharris/patch-2 and squashes the following commits:
      
      0d55a9f [Naftali Harris] Avoid numerical instability
      e3d85b7e
    • Reynold Xin's avatar
      [SPARK-2747] git diff --dirstat can miss sql changes and not run Hive tests · 3bc3f180
      Reynold Xin authored
      dev/run-tests use "git diff --dirstat master" to check whether sql is changed. However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc change in core, and only 1 loc change in hive).
      
      We should use "git diff --name-only master" instead.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1656 from rxin/hiveTest and squashes the following commits:
      
      f5eab9f [Reynold Xin] [SPARK-2747] git diff --dirstat can miss sql changes and not run Hive tests.
      3bc3f180
    • Reynold Xin's avatar
      [SPARK-2521] Broadcast RDD object (instead of sending it along with every task) · 774142f5
      Reynold Xin authored
      This is a resubmission of #1452. It was reverted because it broke the build.
      
      Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables.
      
      The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large.
      
      The user-facing impact of the change include:
      
      1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations
      2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput.
      
      In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently).
      
      A simple way to test this:
      ```scala
      val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
      sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count
      ```
      
      Numbers on 3 r3.8xlarge instances on EC2
      ```
      master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
      with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
      ```
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1498 from rxin/broadcast-task and squashes the following commits:
      
      f7364db [Reynold Xin] Code review feedback.
      f8535dc [Reynold Xin] Fixed the style violation.
      252238d [Reynold Xin] Serialize the final task closure as well as ShuffleDependency in taskBinary.
      111007d [Reynold Xin] Fix broadcast tests.
      797c247 [Reynold Xin] Properly send SparkListenerStageSubmitted and SparkListenerStageCompleted.
      bab1d8b [Reynold Xin] Check for NotSerializableException in submitMissingTasks.
      cf38450 [Reynold Xin] Use TorrentBroadcastFactory.
      991c002 [Reynold Xin] Use HttpBroadcast.
      de779f8 [Reynold Xin] Fix TaskContextSuite.
      cc152fc [Reynold Xin] Don't cache the RDD broadcast variable.
      d256b45 [Reynold Xin] Fixed unit test failures. One more to go.
      cae0af3 [Reynold Xin] [SPARK-2521] Broadcast RDD object (instead of sending it along with every task).
      774142f5
    • Sean Owen's avatar
      SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to Math.exp, Math.log · ee07541e
      Sean Owen authored
      In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.
      
      Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.
      
      While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.
      
      Also note the related PR for Python: https://github.com/apache/spark/pull/1652
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1659 from srowen/SPARK-2748 and squashes the following commits:
      
      c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
      ee07541e
    • Koert Kuipers's avatar
      SPARK-2543: Allow user to set maximum Kryo buffer size · 7c5fc28a
      Koert Kuipers authored
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:
      
      15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
      1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
      0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
      143ec4d [Koert Kuipers] test resizable buffer in kryo Output
      0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
      7c5fc28a
    • Yin Huai's avatar
      [SPARK-2179][SQL] Public API for DataTypes and Schema · 7003c163
      Yin Huai authored
      The current PR contains the following changes:
      * Expose `DataType`s in the sql package (internal details are private to sql).
      * Users can create Rows.
      * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
      * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
      * `JsonRDD` has been refactored to use changes introduced by this PR.
      * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
      
      New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
      [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
      
      An example of using `applySchema` is shown below.
      ```scala
      import org.apache.spark.sql._
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      
      val schema =
        StructType(
          StructField("name", StringType, false) ::
          StructField("age", IntegerType, true) :: Nil)
      
      val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
      val peopleSchemaRDD = sqlContext. applySchema(people, schema)
      peopleSchemaRDD.printSchema
      // root
      // |-- name: string (nullable = false)
      // |-- age: integer (nullable = true)
      
      peopleSchemaRDD.registerAsTable("people")
      sqlContext.sql("select name from people").collect.foreach(println)
      ```
      
      I will add new contents to the SQL programming guide later.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2179
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
      
      1d45977 [Yin Huai] Clean up.
      a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c712fbf [Yin Huai] Converts types of values based on defined schema.
      4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e5f8df5 [Yin Huai] Scaladoc.
      122d1e7 [Yin Huai] Address comments.
      03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2476ed0 [Yin Huai] Minor updates.
      ab71f21 [Yin Huai] Format.
      fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      bd40a33 [Yin Huai] Address comments.
      991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
      1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
      3edb3ae [Yin Huai] Python doc.
      692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      1d93395 [Yin Huai] Python APIs.
      246da96 [Yin Huai] Add java data type APIs to javadoc index.
      1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      d48fc7b [Yin Huai] Minor updates.
      33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b9f3071 [Yin Huai] Java API for applySchema.
      1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
      624765c [Yin Huai] Tests for applySchema.
      aa92e84 [Yin Huai] Update data type tests.
      8da1a17 [Yin Huai] Add Row.fromSeq.
      9c99bc0 [Yin Huai] Several minor updates.
      1d9c13a [Yin Huai] Update applySchema API.
      85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e495e4e [Yin Huai] More comments.
      42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
      68525a2 [Yin Huai] Update JSON unit test.
      3209108 [Yin Huai] Add unit tests.
      dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
      9168b83 [Yin Huai] Update comments.
      fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
      949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
      7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
      43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
      0266761 [Yin Huai] Format
      03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
      3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
      16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      7003c163
    • Andrew Or's avatar
      [SPARK-2260] Fix standalone-cluster mode, which was broken · 4ce92cca
      Andrew Or authored
      The main thing was that spark configs were not propagated to the driver, and so applications that do not specify `master` or `appName` automatically failed. This PR fixes that and a couple of miscellaneous things that are related.
      
      One thing that may or may not be an issue is that the jars must be available on the driver node. In `standalone-cluster` mode, this effectively means these jars must be available on all the worker machines, since the driver is launched on one of them. The semantics here are not the same as `yarn-cluster` mode,  where all the relevant jars are uploaded to a distributed cache automatically and shipped to the containers. This is probably not a concern, but still worth a mention.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1538 from andrewor14/standalone-cluster and squashes the following commits:
      
      8c11a0d [Andrew Or] Clean up imports / comments (minor)
      2678d13 [Andrew Or] Handle extraJavaOpts properly
      7660547 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
      6f64a9b [Andrew Or] Revert changes in YARN
      2f2908b [Andrew Or] Fix tests
      ed01491 [Andrew Or] Don't go overboard with escaping
      8e105e1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
      b890949 [Andrew Or] Abstract usages of converting spark opts to java opts
      79f63a3 [Andrew Or] Move sparkProps into javaOpts
      78752f8 [Andrew Or] Fix tests
      5a9c6c7 [Andrew Or] Fix line too long
      c141a00 [Andrew Or] Don't display "unknown app" on driver log pages
      d7e2728 [Andrew Or] Avoid deprecation warning in standalone Client
      6ceb14f [Andrew Or] Allow relevant configs to propagate to standalone Driver
      7f854bc [Andrew Or] Fix test
      855256e [Andrew Or] Fix standalone-cluster mode
      fd9da51 [Andrew Or] Formatting changes (minor)
      4ce92cca
    • Michael Armbrust's avatar
      [SQL] Handle null values in debug() · 077f633b
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1646 from marmbrus/nullDebug and squashes the following commits:
      
      49050a8 [Michael Armbrust] Handle null values in debug()
      077f633b
    • Xiangrui Meng's avatar
      [SPARK-2568] RangePartitioner should run only one job if data is balanced · 2e6efcac
      Xiangrui Meng authored
      As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).
      
      `RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items.
      
      The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1562 from mengxr/range-partitioner and squashes the following commits:
      
      6cc2551 [Xiangrui Meng] change foreach to for
      eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner
      eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl
      c436d30 [Xiangrui Meng] fix binary metrics unit tests
      db58a55 [Xiangrui Meng] add unit tests
      a6e35d6 [Xiangrui Meng] minor update
      60be09e [Xiangrui Meng] remove importance sampler
      9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data
      cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
      06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
      17bcbf3 [Reynold Xin] Added seed.
      badf20d [Reynold Xin] Renamed the method.
      6940010 [Reynold Xin] Reservoir sampling implementation.
      2e6efcac
  3. Jul 29, 2014
    • Michael Armbrust's avatar
      [SPARK-2054][SQL] Code Generation for Expression Evaluation · 84467468
      Michael Armbrust authored
      Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.
      
      Evaluation can be done in several specialized ways:
       - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
       - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
       - *Condition* - Returns `true` or `false` given an input row.
      
      For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.
      
      This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.
      
      This PR also performs several clean ups that simplified the implementation:
       - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
       - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #993 from marmbrus/newCodeGen and squashes the following commits:
      
      96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
      f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
      67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
      4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      fed3634 [Michael Armbrust] Inspectors are not serializable.
      ef8d42b [Michael Armbrust] comments
      533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
      3cd773e [Michael Armbrust] Allow codegen for Generate.
      64b2ee1 [Michael Armbrust] Implement copy
      3587460 [Michael Armbrust] Drop unused string builder function.
      9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      1a61293 [Michael Armbrust] Address review comments.
      0672e8a [Michael Armbrust] Address comments.
      1ec2d6e [Michael Armbrust] Address comments
      033abc6 [Michael Armbrust] off by default
      4771fab [Michael Armbrust] Docs, more test coverage.
      d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
      be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
      bc88ecd [Michael Armbrust] Style
      6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
      4220f1e [Michael Armbrust] Better config, docs, etc.
      ca6cc6b [Michael Armbrust] WIP
      9d67d85 [Michael Armbrust] Fix hive planner
      fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
      e742640 [Michael Armbrust] Remove unneeded changes and code.
      675e679 [Michael Armbrust] Upgrade paradise.
      0093376 [Michael Armbrust] Comment / indenting cleanup.
      d81f998 [Michael Armbrust] include schema for binding.
      0e889e8 [Michael Armbrust] Use typeOf instead tq
      f623ffd [Michael Armbrust] Quiet logging from test suite.
      efad14f [Michael Armbrust] Remove some half finished functions.
      92e74a4 [Michael Armbrust] add overrides
      a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
      84467468
    • Josh Rosen's avatar
      [SPARK-2305] [PySpark] Update Py4J to version 0.8.2.1 · 22649b6c
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1626 from JoshRosen/SPARK-2305 and squashes the following commits:
      
      03fb283 [Josh Rosen] Update Py4J to version 0.8.2.1.
      22649b6c
    • Michael Armbrust's avatar
      [SPARK-2631][SQL] Use SQLConf to configure in-memory columnar caching · 86534d0f
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1638 from marmbrus/cachedConfig and squashes the following commits:
      
      2362082 [Michael Armbrust] Use SQLConf to configure in-memory columnar caching
      86534d0f
    • Michael Armbrust's avatar
      [SPARK-2716][SQL] Don't check resolved for having filters. · 39b81931
      Michael Armbrust authored
      For queries like `... HAVING COUNT(*) > 9` the expression is always resolved since it contains no attributes.  This was causing us to avoid doing the Having clause aggregation rewrite.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1640 from marmbrus/havingNoRef and squashes the following commits:
      
      92d3901 [Michael Armbrust] Don't check resolved for having filters.
      39b81931
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 2c356665
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #740 (close requested by 'rxin')
      Closes #647 (close requested by 'rxin')
      Closes #1383 (close requested by 'rxin')
      Closes #1485 (close requested by 'pwendell')
      Closes #693 (close requested by 'rxin')
      Closes #478 (close requested by 'JoshRosen')
      2c356665
    • Zongheng Yang's avatar
      [SPARK-2393][SQL] Cost estimation optimization framework for Catalyst logical plans & sample usage. · c7db274b
      Zongheng Yang authored
      The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.
      
      This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).
      
      Finally, there are a couple minor accompanying changes including:
      - Remove the not-in-use `BaseRelation`.
      - Make SparkLogicalPlan take a `SQLContext` in the second param list.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1238 from concretevitamin/estimates and squashes the following commits:
      
      329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
      8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
      2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
      9951305 [Zongheng Yang] Remove childrenStats.
      16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
      8bd2816 [Zongheng Yang] Add a note on performance of statistics.
      6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
      01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
      549061c [Zongheng Yang] Remove numTuples in Statistics for now.
      729a8e2 [Zongheng Yang] Update docs to be more explicit.
      573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
      2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
      ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
      43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
      0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
      4ef0d26 [Zongheng Yang] Make Statistics a case class.
      3ba8f3e [Zongheng Yang] Add comment.
      e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
      7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
      73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
      73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
      7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
      de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
      dcff9bd [Zongheng Yang] Cleanups.
      84301a4 [Zongheng Yang] Refactors.
      5bf5586 [Zongheng Yang] Typo.
      56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
      c7db274b
    • Doris Xin's avatar
      [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size · dc965364
      Doris Xin authored
      Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1025 from dorx/stratified and squashes the following commits:
      
      245439e [Doris Xin] moved minSamplingRate to getUpperBound
      eaf5771 [Doris Xin] bug fixes.
      17a381b [Doris Xin] fixed a merge issue and a failed unit
      ea7d27f [Doris Xin] merge master
      b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
      b3013a4 [Xiangrui Meng] move math3 back to test scope
      eecee5f [Doris Xin] Merge branch 'master' into stratified
      f4c21f3 [Doris Xin] Reviewer comments
      a10e68d [Doris Xin] style fix
      a2bf756 [Doris Xin] Merge branch 'master' into stratified
      680b677 [Doris Xin] use mapPartitionWithIndex instead
      9884a9f [Doris Xin] style fix
      bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
      ee9d260 [Doris Xin] addressed reviewer comments
      6b5b10b [Doris Xin] Merge branch 'master' into stratified
      254e03c [Doris Xin] minor fixes and Java API.
      4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
      bd9dc6e [Doris Xin] unit bug and style violation fixed
      1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
      944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
      0214a76 [Doris Xin] cleanUp
      90d94c0 [Doris Xin] merge master
      9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
      7327611 [Doris Xin] merge master
      50581fc [Doris Xin] added a TODO for logging in python
      46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
      7e1a481 [Doris Xin] changed the permission on SamplingUtil
      1d413ce [Doris Xin] fixed checkstyle issues
      9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
      e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
      7cab53a [Doris Xin] fixed import bug in rdd.py
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      dc965364
    • Davies Liu's avatar
      [SPARK-2674] [SQL] [PySpark] support datetime type for SchemaRDD · f0d880e2
      Davies Liu authored
      Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema().
      
      In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1601 from davies/date and squashes the following commits:
      
      f0599b0 [Davies Liu] remove tests for sets and tuple in sql, fix list of list
      c9d607a [Davies Liu] convert datetype for runtime
      709d40d [Davies Liu] remove brackets
      96db384 [Davies Liu] support datetime type for SchemaRDD
      f0d880e2
    • Yin Huai's avatar
      [SPARK-2730][SQL] When retrieving a value from a Map, GetItem evaluates key twice · e3643485
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2730
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1637 from yhuai/SPARK-2730 and squashes the following commits:
      
      1a9f24e [Yin Huai] Remove unnecessary key evaluation.
      e3643485
Loading