Skip to content
Snippets Groups Projects
  1. Jan 25, 2017
    • Holden Karau's avatar
      [SPARK-19064][PYSPARK] Fix pip installing of sub components · a5c10ff2
      Holden Karau authored
      
      ## What changes were proposed in this pull request?
      
      Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.
      
      ## How was this patch tested?
      
      Updated sanity test script to import mllib and ml sub-components.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
      
      (cherry picked from commit 965c82d8)
      Signed-off-by: default avatarHolden Karau <holden@us.ibm.com>
      a5c10ff2
    • Marcelo Vanzin's avatar
      [SPARK-18750][YARN] Follow up: move test to correct directory in 2.1 branch. · 97d3353e
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16704 from vanzin/SPARK-18750_2.1.
      97d3353e
    • Marcelo Vanzin's avatar
      [SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · c9f075ab
      Marcelo Vanzin authored
      
      The code was failing to propagate the user conf in the case where the
      JVM was already initialized, which happens when a user submits a
      python script via spark-submit.
      
      Tested with new unit test and by running a python script in a real cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16682 from vanzin/SPARK-19307.
      
      (cherry picked from commit 92afaa93)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      c9f075ab
    • Nattavut Sutyanyong's avatar
      [SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a... · af954553
      Nattavut Sutyanyong authored
      [SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error
      
      ## What changes were proposed in this pull request?
      This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.
      
      ## How was this patch tested?
      Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.
      
      ````
      -- TC 01.01
      -- The column t2b in the SELECT of the subquery is invalid
      -- because it is neither an aggregate function nor a GROUP BY column.
      select t1a, t2b
      from   t1, t2
      where  t1b = t2c
      and    t2b = (select max(avg)
                    from   (select   t2b, avg(t2b) avg
                            from     t2
                            where    t2a = t1.t1b
                           )
                   )
      ;
      
      -- TC 01.02
      -- Invalid due to the column t2b not part of the output from table t2.
      select *
      from   t1
      where  t1a in (select   min(t2a)
                     from     t2
                     group by t2c
                     having   t2c in (select   max(t3c)
                                      from     t3
                                      group by t3b
                                      having   t3b > t2b ))
      ;
      ````
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16572 from nsyca/18863.
      
      (cherry picked from commit f1ddca5f)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      af954553
    • Marcelo Vanzin's avatar
      [SPARK-18750][YARN] Avoid using "mapValues" when allocating containers. · f391ad2c
      Marcelo Vanzin authored
      
      That method is prone to stack overflows when the input map is really
      large; instead, use plain "map". Also includes a unit test that was
      tested and caused stack overflows without the fix.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16667 from vanzin/SPARK-18750.
      
      (cherry picked from commit 76db394f)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      f391ad2c
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · e2f77392
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      
      ## How was this patch tested?
      
      The patch was tested locally by building the docs. The examples were run as well.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png
      
      )
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16329 from aokolnychyi/SPARK-16046.
      
      (cherry picked from commit 3fdce814)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      e2f77392
  2. Jan 24, 2017
    • Liwei Lin's avatar
      [SPARK-19330][DSTREAMS] Also show tooltip for successful batches · c1337879
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      ### Before
      ![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)
      
      ### After
      ![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png
      
      )
      
      ## How was this patch tested?
      
      Manually
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16673 from lw-lin/streaming.
      
      (cherry picked from commit 40a4cfc7)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      c1337879
    • Nattavut Sutyanyong's avatar
      [SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results · b94fb284
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      
      This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
      
      Example:
      The query
      
       select a1,b1
       from   t1
       where  (a1,b1) not in (select a2,b2
                              from   t2);
      
      has the (a1, b1) = (a2, b2) rewritten from (before this fix):
      
      Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
      
      to (after this fix):
      
      Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
      
      ## How was this patch tested?
      
      sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16467 from nsyca/19017.
      
      (cherry picked from commit cdb691eb)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      b94fb284
    • Ilya Matiach's avatar
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case · d128b6a3
      Ilya Matiach authored
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in which BisectingKMeans fails with error:
      java.util.NoSuchElementException: key not found: 166
              at scala.collection.MapLike$class.default(MapLike.scala:228)
              at scala.collection.AbstractMap.default(Map.scala:58)
              at scala.collection.MapLike$class.apply(MapLike.scala:141)
              at scala.collection.AbstractMap.apply(Map.scala:58)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
              at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
              at scala.collection.immutable.List.foldLeft(List.scala:84)
              at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
              at scala.collection.immutable.List.reduceLeft(List.scala:84)
              at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
              at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
              at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
      
      ## How was this patch tested?
      
      The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
      Unverified
      d128b6a3
    • Felix Cheung's avatar
      [SPARK-18823][SPARKR] add support for assigning to column · 9c04e427
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Support for
      ```
      df[[myname]] <- 1
      df[[2]] <- df$eruptions
      ```
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16663 from felixcheung/rcolset.
      
      (cherry picked from commit f27e0247)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      9c04e427
    • Shixiong Zhu's avatar
      [SPARK-19268][SS] Disallow adaptive query execution for streaming queries · 570e5e11
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      As adaptive query execution may change the number of partitions in different batches, it may break streaming queries. Hence, we should disallow this feature in Structured Streaming.
      
      ## How was this patch tested?
      
      `test("SPARK-19268: Adaptive query execution should be disallowed")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16683 from zsxwing/SPARK-19268.
      
      (cherry picked from commit 60bd91a3)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      570e5e11
    • hyukjinkwon's avatar
      [SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions... · 4a2be090
      hyukjinkwon authored
      [SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF
      
      ## What changes were proposed in this pull request?
      
      Currently, running the codes in Java
      
      ```java
      spark.udf().register("inc", new UDF1<Long, Long>() {
        Override
        public Long call(Long i) {
          return i + 1;
        }
      }, DataTypes.LongType);
      
      spark.range(10).toDF("x").createOrReplaceTempView("tmp");
      Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
      Assert.assertEquals(7, result.getLong(0));
      ```
      
      fails as below:
      
      ```
      org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
      Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
      +- SubqueryAlias tmp, `tmp`
         +- Project [id#16L AS x#19L]
            +- Range (0, 10, step=1, splits=Some(8))
      
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
      	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
      ```
      
      The root cause is because we were creating the function every time when it needs to build as below:
      
      ```scala
      scala> def inc(i: Int) = i + 1
      inc: (i: Int)Int
      
      scala> (inc(_: Int)).hashCode
      res15: Int = 1231799381
      
      scala> (inc(_: Int)).hashCode
      res16: Int = 2109839984
      
      scala> (inc(_: Int)) == (inc(_: Int))
      res17: Boolean = false
      ```
      
      This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`.
      
      In case of Scala one, it seems already fine.
      
      Both can be tested easily as below if any reviewer is more comfortable with Scala:
      
      ```scala
      val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
      val javaUDF = new UDF1[Int, Int]  {
        override def call(i: Int): Int = i + 1
      }
      // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API
      // spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for Scala API
      df.createOrReplaceTempView("tmp")
      spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
      ```
      
      ## How was this patch tested?
      
      Unit test in `JavaUDFSuite.java` and `./dev/lint-java`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16553 from HyukjinKwon/SPARK-9435.
      
      (cherry picked from commit e576c1ed)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      4a2be090
  3. Jan 23, 2017
    • jerryshao's avatar
      [SPARK-19306][CORE] Fix inconsistent state in DiskBlockObject when expection occurred · ed5d1e72
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16657 from jerryshao/SPARK-19306.
      
      (cherry picked from commit e4974721)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      ed5d1e72
    • actuaryzhang's avatar
      [SPARK-19155][ML] Make family case insensitive in GLM · 1e07a719
      actuaryzhang authored
      
      ## What changes were proposed in this pull request?
      This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
      ```
      model.getFamily == Binomial.name || model.getFamily == Poisson.name
      ```
      
      ## How was this patch tested?
      Update existing tests for 'Poisson' and 'Binomial'.
      
      yanboliang felixcheung imatiach-msft
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16675 from actuaryzhang/family.
      
      (cherry picked from commit f067acef)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      1e07a719
  4. Jan 21, 2017
  5. Jan 20, 2017
  6. Jan 19, 2017
    • Wenchen Fan's avatar
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when... · 7bc3e9ba
      Wenchen Fan authored
      [SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table
      
      ## What changes were proposed in this pull request?
      
      When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.
      
      However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.
      
      This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
      * SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
      * SPARK-18912: We forget to check the number of columns for non-file-based data source table
      * SPARK-18913: We don't support append data to a table with special column names.
      
      ## How was this patch tested?
      new regression test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16313 from cloud-fan/bug1.
      
      (cherry picked from commit f923c849)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      7bc3e9ba
  7. Jan 18, 2017
    • Liwei Lin's avatar
      [SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error · 4cff0b50
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      We should call `StateStore.abort()` when there should be any error before the store is committed.
      
      ## How was this patch tested?
      
      Manually.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16547 from lw-lin/append-filter.
      
      (cherry picked from commit 569e5068)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      4cff0b50
    • Shixiong Zhu's avatar
      [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from... · 047506ba
      Shixiong Zhu authored
      [SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests
      
      ## What changes were proposed in this pull request?
      
      #16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16567 from zsxwing/SPARK-19113-2.
      
      (cherry picked from commit c050c122)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      047506ba
    • Felix Cheung's avatar
      [SPARK-19231][SPARKR] add error handling for download and untar for Spark release · 77202a6c
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16589 from felixcheung/rtarreturncode.
      
      (cherry picked from commit 278fa1eb)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      77202a6c
  8. Jan 17, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19066][SPARKR][BACKPORT-2.1] LDA doesn't set optimizer correctly · 29b954bb
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      Back port the fix to SPARK-19066 to 2.1 branch.
      
      ## How was this patch tested?
      Unit tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16623 from wangmiao1981/bugport.
      29b954bb
    • gatorsmile's avatar
      [SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec · 3ec3e3f2
      gatorsmile authored
      
      Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
      
      ```Scala
      val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
      df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
      spark.sql("alter table partitionedTable drop partition(partCol1='')")
      spark.table("partitionedTable").show()
      ```
      
      In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
      
      When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
      
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16583 from gatorsmile/disallowEmptyPartColValue.
      
      (cherry picked from commit a23debd7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      3ec3e3f2
    • Shixiong Zhu's avatar
      [SPARK-19065][SQL] Don't inherit expression id in dropDuplicates · 13986a72
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.
      
      ## How was this patch tested?
      
      test("SPARK-19065: dropDuplicates should not create expressions using the same id")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16564 from zsxwing/SPARK-19065.
      
      (cherry picked from commit a83accfc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      13986a72
    • hyukjinkwon's avatar
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 2ff36691
      hyukjinkwon authored
      [SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0
      
      ## What changes were proposed in this pull request?
      
      Currently, PySpark does not work with Python 3.6.0.
      
      Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
      
      ```
      Traceback (most recent call last):
        File ".../spark/python/pyspark/shell.py", line 30, in <module>
          import pyspark
        File ".../spark/python/pyspark/__init__.py", line 46, in <module>
          from pyspark.context import SparkContext
        File ".../spark/python/pyspark/context.py", line 36, in <module>
          from pyspark.java_gateway import launch_gateway
        File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
          from py4j.java_gateway import java_import, JavaGateway, GatewayClient
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
          import pkgutil
        File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
          ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
        File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
          cls = _old_namedtuple(*args, **kwargs)
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
      ```
      
      The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628
      
      ).
      
      We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
      
      This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
      
      Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
      
      ## How was this patch tested?
      
      Manually tested with Python 2.7.6 and Python 3.6.0.
      
      ```
      ./bin/pyspsark
      ```
      
      , manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
      
      and Jenkins tests for other Python versions.
      
      Also,
      
      ```
      ./run-tests --python-executables=python3.6
      ```
      
      ```
      Will test against the following Python executables: ['python3.6']
      Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
      Finished test(python3.6): pyspark.sql.tests (192s)
      Finished test(python3.6): pyspark.accumulators (3s)
      Finished test(python3.6): pyspark.mllib.tests (198s)
      Finished test(python3.6): pyspark.broadcast (3s)
      Finished test(python3.6): pyspark.conf (2s)
      Finished test(python3.6): pyspark.context (14s)
      Finished test(python3.6): pyspark.ml.classification (21s)
      Finished test(python3.6): pyspark.ml.evaluation (11s)
      Finished test(python3.6): pyspark.ml.clustering (20s)
      Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.streaming.tests (240s)
      Finished test(python3.6): pyspark.tests (240s)
      Finished test(python3.6): pyspark.ml.recommendation (19s)
      Finished test(python3.6): pyspark.ml.feature (36s)
      Finished test(python3.6): pyspark.ml.regression (37s)
      Finished test(python3.6): pyspark.ml.tuning (28s)
      Finished test(python3.6): pyspark.mllib.classification (26s)
      Finished test(python3.6): pyspark.mllib.evaluation (18s)
      Finished test(python3.6): pyspark.mllib.clustering (44s)
      Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
      Finished test(python3.6): pyspark.mllib.feature (26s)
      Finished test(python3.6): pyspark.mllib.fpm (23s)
      Finished test(python3.6): pyspark.mllib.random (8s)
      Finished test(python3.6): pyspark.ml.tests (92s)
      Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
      Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
      Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
      Finished test(python3.6): pyspark.mllib.recommendation (24s)
      Finished test(python3.6): pyspark.mllib.regression (26s)
      Finished test(python3.6): pyspark.profiler (9s)
      Finished test(python3.6): pyspark.mllib.tree (16s)
      Finished test(python3.6): pyspark.shuffle (1s)
      Finished test(python3.6): pyspark.mllib.util (18s)
      Finished test(python3.6): pyspark.serializers (11s)
      Finished test(python3.6): pyspark.rdd (20s)
      Finished test(python3.6): pyspark.sql.conf (8s)
      Finished test(python3.6): pyspark.sql.catalog (17s)
      Finished test(python3.6): pyspark.sql.column (18s)
      Finished test(python3.6): pyspark.sql.context (18s)
      Finished test(python3.6): pyspark.sql.group (27s)
      Finished test(python3.6): pyspark.sql.dataframe (33s)
      Finished test(python3.6): pyspark.sql.functions (35s)
      Finished test(python3.6): pyspark.sql.types (6s)
      Finished test(python3.6): pyspark.sql.streaming (13s)
      Finished test(python3.6): pyspark.streaming.util (0s)
      Finished test(python3.6): pyspark.sql.session (16s)
      Finished test(python3.6): pyspark.sql.window (4s)
      Finished test(python3.6): pyspark.sql.readwriter (35s)
      Tests passed in 433 seconds
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16429 from HyukjinKwon/SPARK-19019.
      
      (cherry picked from commit 20e62806)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      2ff36691
  9. Jan 16, 2017
  10. Jan 15, 2017
    • gatorsmile's avatar
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan... · bf2f233e
      gatorsmile authored
      [SPARK-19092][SQL][BACKPORT-2.1] Save() API of DataFrameWriter should not scan all the saved files #16481
      
      ### What changes were proposed in this pull request?
      
      #### This PR is to backport https://github.com/apache/spark/pull/16481 to Spark 2.1
      ---
      `DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.
      
      ### How was this patch tested?
      Added and modified the test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16588 from gatorsmile/backport-19092.
      bf2f233e
    • gatorsmile's avatar
      [SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · db37049d
      gatorsmile authored
      
      ```Scala
              sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")
      
              // This table fetch is to fill the cache with zero leaf files
              spark.table("tab").show()
      
              sql(
                s"""
                   |LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
                   |INTO TABLE tab
                 """.stripMargin)
      
              spark.table("tab").show()
      ```
      
      In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of  parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.
      
      This PR is to refresh the metadata cache after processing the `LOAD DATA` command.
      
      In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.
      
      Added test cases in parquetSuites.scala
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.
      
      (cherry picked from commit de62ddf7)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      db37049d
  11. Jan 13, 2017
    • Yucai Yu's avatar
      [SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · 5e9be1e1
      Yucai Yu authored
      
      ## What changes were proposed in this pull request?
      
      the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.
      
      ## How was this patch tested?
      
      unit test
      
      Author: Yucai Yu <yucai.yu@intel.com>
      
      Closes #16555 from yucai/offheap_short.
      
      (cherry picked from commit ad0dadaa)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      5e9be1e1
    • Felix Cheung's avatar
      [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · ee3642f5
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      To allow specifying number of partitions when the DataFrame is created
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16512 from felixcheung/rnumpart.
      
      (cherry picked from commit b0e8eb6d)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      ee3642f5
    • Wenchen Fan's avatar
      [SPARK-19178][SQL] convert string of large numbers to int should return null · 2c2ca894
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
      
      However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
      
      This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
      
      ## How was this patch tested?
      
      new regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16550 from cloud-fan/string-to-int.
      
      (cherry picked from commit 6b34e745)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      2c2ca894
    • Vinayak's avatar
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8
      Vinayak authored
      [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error
      
      Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.
      
      Existing unit tests and a new unit test added to pyspark-sql:
      
      /python/run-tests --python-executables=python --modules=pyspark-sql
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Vinayak <vijoshi5@in.ibm.com>
      Author: Vinayak Joshi <vijoshi@users.noreply.github.com>
      
      Closes #16119 from vijoshi/SPARK-18687_master.
      
      (cherry picked from commit 285a7798)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      b2c9a2c8
    • Andrew Ash's avatar
      Fix missing close-parens for In filter's toString · 0668e061
      Andrew Ash authored
      
      Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
      
          PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #16558 from ash211/patch-9.
      
      (cherry picked from commit b040cef2)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      0668e061
  12. Jan 12, 2017
Loading