  1. Dec 28, 2015
    • Josh Rosen's avatar
      [SPARK-12508][PROJECT-INFRA] Fix minor bugs in dev/tests/ script · ab6bedd8
      Josh Rosen authored
      This patch fixes a handful of minor bugs in the `dev/tests/` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes:
      - Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X.
      - Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives.
      - Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page]( on the Shellcheck wiki for more details).
      Author: Josh Rosen <>
      Closes #10455 from JoshRosen/fix-pr-public-classes-test.
    • Cheng Lian's avatar
      [SPARK-12218] Fixes ORC conjunction predicate push down · 8e23d8db
      Cheng Lian authored
      This PR is a follow-up of PR #10362.
      Two major changes:
      1.  The fix introduced in #10362 is OK for Parquet, but may disable ORC PPD in many cases
          PR #10362 stops converting an `AND` predicate if any branch is inconvertible.  On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`.  This means, if any filter is inconvertible, no filters can be pushed down.  This PR fixes this issue by finding out all convertible filters first before doing the actual conversion.
          The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail.
      1.  Copied the `AND` predicate fix for ORC from #10362 to avoid merge conflict.
      Same as #10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5.
      Author: Cheng Lian <>
      Closes #10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.
    • jerryshao's avatar
      [SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API · 8d494009
      jerryshao authored
      The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.
      Author: jerryshao <>
      Closes #10350 from jerryshao/SPARK-12353.
    • felixcheung's avatar
      [SPARK-12515][SQL][DOC] minor doc update for read.jdbc · 5aa2710c
      felixcheung authored
      Author: felixcheung <>
      Closes #10465 from felixcheung/dfreaderjdbcdoc.
    • gatorsmile's avatar
      [SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join · 9ab296ec
      gatorsmile authored
      After reading the JIRA, I double checked the code.
      For example, users can do the Equi-Join like
        ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
      - There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
      - After a PR:, the 1.6 does not have such an issue, but the description has not been updated.
      Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.
      Author: gatorsmile <>
      Closes #10477 from gatorsmile/pyOuterJoin.
  2. Dec 25, 2015
  3. Dec 24, 2015
    • pierre-borckmans's avatar
      [SPARK-12440][CORE] Avoid setCheckpoint warning when directory is not local · ea4aab7e
      pierre-borckmans authored
      In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local.
      In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing.
      In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected.
      This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not.
      (The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not).
      The warning is only issued if the directory is not local, on top of the existing conditions.
      Author: pierre-borckmans <>
      Closes #10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.
    • CK50's avatar
      [SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax · 502476e4
      CK50 authored
      In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):
      INSERT INTO $table VALUES ( ?, ?, ..., ? )
      But some technologies require a list of column names:
      INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
      This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.
      Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().
      If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.
      This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.
      Author: CK50 <>
      Closes #10380 from CK50/master-SPARK-12010-2.
    • Kazuaki Ishizaki's avatar
      [SPARK-12311][CORE] Restore previous value of "os.arch" property in test... · 39204661
      Kazuaki Ishizaki authored
      [SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property
      Restore the original value of os.arch property after each test
      Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
      Author: Kazuaki Ishizaki <>
      Closes #10289 from kiszk/SPARK-12311.
    • Kazuaki Ishizaki's avatar
      [SPARK-12502][BUILD][PYTHON] Script /dev/run-tests fails when IBM Java is used · 9e85bb71
      Kazuaki Ishizaki authored
      fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx'
      Author: Kazuaki Ishizaki <>
      Closes #10463 from kiszk/SPARK-12502.
  4. Dec 23, 2015
    • Adrian Bridgett's avatar
      [SPARK-12499][BUILD] don't force MAVEN_OPTS · ead6abf7
      Adrian Bridgett authored
      allow the user to override MAVEN_OPTS (2GB wasn't sufficient for me)
      Author: Adrian Bridgett <>
      Closes #10448 from abridgett/feature/do_not_force_maven_opts.
    • Sean Owen's avatar
      [SPARK-12500][CORE] Fix Tachyon deprecations; pull Tachyon dependency into one class · ae1f54aa
      Sean Owen authored
      Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only
      CC calvinjia as I probably need a double-check that the usage of the new API is correct.
      Author: Sean Owen <>
      Closes #10449 from srowen/SPARK-12500.
    • pierre-borckmans's avatar
      [SPARK-12477][SQL] - Tungsten projection fails for null values in array fields · 43b2a639
      pierre-borckmans authored
      Accessing null elements in an array field fails when tungsten is enabled.
      It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
      This PR solves this by checking if the accessed element in the array field is null, in the generated code.
      // Array of String
      case class AS( as: Seq[String] )
      val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
      for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
      With Tungsten disabled:
      0 = [a]
      1 = [null]
      2 = [b]
      With Tungsten enabled:
      0 = [a]
      15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
      	at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
      	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
      	at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
      	at scala.collection.Iterator$$anon$
      	at scala.collection.Iterator$$anon$
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      Author: pierre-borckmans <>
      Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.
    • Liang-Chi Hsieh's avatar
      [SPARK-11164][SQL] Add InSet pushdown filter back for Parquet · 50301c0a
      Liang-Chi Hsieh authored
      When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!
      Author: gatorsmile <>
      Author: xiaoli <>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      Closes #10278 from gatorsmile/parquetFilterNot.
  5. Dec 22, 2015
  6. Dec 21, 2015
    • Davies Liu's avatar
      [SPARK-12388] change default compression to lz4 · 29cecd4a
      Davies Liu authored
      According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
      After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
      cc rxin
      Author: Davies Liu <>
      Closes #10342 from davies/lz4.
    • Andrew Or's avatar
      [SPARK-12466] Fix harmless NPE in tests · d655d37d
      Andrew Or authored
      [info] ReplayListenerSuite:
      [info] - Simple replay (58 milliseconds)
      	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
      	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
      This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).
      Tested locally to verify that the NPE is gone.
      Author: Andrew Or <>
      Closes #10417 from andrewor14/fix-harmless-npe.
    • Reynold Xin's avatar
      [SPARK-2331] SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] · a820ca19
      Reynold Xin authored
      Author: Reynold Xin <>
      Closes #10394 from rxin/SPARK-2331.
    • Alex Bozarth's avatar
      [SPARK-12339][SPARK-11206][WEBUI] Added a null check that was removed in · b0849b8a
      Alex Bozarth authored
      Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function.
      Author: Alex Bozarth <>
      Closes #10405 from ajbozarth/spark12339.
    • pshearer's avatar
      Doc typo: ltrim = trim from left end, not right · fc6dbcc7
      pshearer authored
      Author: pshearer <>
      Closes #10414 from pshearer/patch-1.
    • Takeshi YAMAMURO's avatar
      [SPARK-5882][GRAPHX] Add a test for GraphLoader.edgeListFile · 1eb90bc9
      Takeshi YAMAMURO authored
      Author: Takeshi YAMAMURO <>
      Closes #4674 from maropu/AddGraphLoaderSuite.
    • Takeshi YAMAMURO's avatar
      [SPARK-12392][CORE] Optimize a location order of broadcast blocks by... · 935f4663
      Takeshi YAMAMURO authored
      [SPARK-12392][CORE] Optimize a location order of broadcast blocks by considering preferred local hosts
      When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts.
      Author: Takeshi YAMAMURO <>
      Closes #10346 from maropu/OptimizeBlockLocationOrder.
    • gatorsmile's avatar
      [SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Range · 4883a508
      gatorsmile authored
      Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.
      Also added another API for resolving the JIRA Spark-12150.
      Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )
      Thank you very much!
      Author: gatorsmile <>
      Closes #10335 from gatorsmile/rangeOperators.
    • Wenchen Fan's avatar
      [SPARK-12321][SQL] JSON format for TreeNode (use reflection) · 7634fe95
      Wenchen Fan authored
      An alternative solution for , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`.
      Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list.
      example json:
      logical plan tree:
      [ {
        "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort",
        "num-children" : 1,
        "order" : [ [ {
          "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder",
          "num-children" : 1,
          "child" : 0,
          "direction" : "Ascending"
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
          "num-children" : 0,
          "name" : "i",
          "dataType" : "integer",
          "nullable" : true,
          "metadata" : { },
          "exprId" : {
            "id" : 10,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        } ] ],
        "global" : false,
        "child" : 0
      }, {
        "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
        "num-children" : 1,
        "projectList" : [ [ {
          "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
          "num-children" : 1,
          "child" : 0,
          "name" : "i",
          "exprId" : {
            "id" : 10,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.Add",
          "num-children" : 2,
          "left" : 0,
          "right" : 1
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
          "num-children" : 0,
          "name" : "a",
          "dataType" : "integer",
          "nullable" : true,
          "metadata" : { },
          "exprId" : {
            "id" : 0,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
          "num-children" : 0,
          "value" : "1",
          "dataType" : "integer"
        } ], [ {
          "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
          "num-children" : 1,
          "child" : 0,
          "name" : "j",
          "exprId" : {
            "id" : 11,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.Multiply",
          "num-children" : 2,
          "left" : 0,
          "right" : 1
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
          "num-children" : 0,
          "name" : "a",
          "dataType" : "integer",
          "nullable" : true,
          "metadata" : { },
          "exprId" : {
            "id" : 0,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        }, {
          "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
          "num-children" : 0,
          "value" : "2",
          "dataType" : "integer"
        } ] ],
        "child" : 0
      }, {
        "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation",
        "num-children" : 0,
        "output" : [ [ {
          "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
          "num-children" : 0,
          "name" : "a",
          "dataType" : "integer",
          "nullable" : true,
          "metadata" : { },
          "exprId" : {
            "id" : 0,
            "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
          "qualifiers" : [ ]
        } ] ],
        "data" : [ ]
      } ]
      Author: Wenchen Fan <>
      Closes #10311 from cloud-fan/toJson-reflection.
    • Dilip Biswal's avatar
      [SPARK-12398] Smart truncation of DataFrame / Dataset toString · 474eb21a
      Dilip Biswal authored
      When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information.
      // Standard output
      [a: int, b: int]
      // Truncate many top level fields
      [a: int, b, string ... 10 more fields]
      // Truncate long inner structs
      [a: struct<a: Int ... 10 more fields>]
      Author: Dilip Biswal <>
      Closes #10373 from dilipbiswal/spark-12398.
    • Jeff Zhang's avatar
      [PYSPARK] Pyspark typo & Add missing abstractmethod annotation · 1920d72a
      Jeff Zhang authored
      No jira is created since this is a trivial change.
      davies  Please help review it
      Author: Jeff Zhang <>
      Closes #10143 from zjffdu/pyspark_typo.