Skip to content
Snippets Groups Projects
  1. Nov 30, 2015
  2. Nov 29, 2015
  3. Nov 28, 2015
    • felixcheung's avatar
      [SPARK-9319][SPARKR] Add support for setting column names, types · c793d2d9
      felixcheung authored
      Add support for for colnames, colnames<-, coltypes<-
      Also added tests for names, names<- which have no test previously.
      
      I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
      
      shivaram sun-rui
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9654 from felixcheung/colnamescoltypes.
      c793d2d9
    • felixcheung's avatar
      [SPARK-12029][SPARKR] Improve column functions signature, param check, tests,... · 28e46ab4
      felixcheung authored
      [SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples
      
      shivaram sun-rui
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10019 from felixcheung/rfunctionsdoc.
      28e46ab4
    • gatorsmile's avatar
      [SPARK-12028] [SQL] get_json_object returns an incorrect result when the value is null literals · 149cd692
      gatorsmile authored
      When calling `get_json_object` for the following two cases, both results are `"null"`:
      
      ```scala
          val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
          val df: DataFrame = tuple.toDF("key", "jstring")
          val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
      ```
      ```scala
          val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
          val df2: DataFrame = tuple2.toDF("key", "jstring")
          val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
      ```
      
      Fixed the problem and also added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10018 from gatorsmile/get_json_object.
      149cd692
  4. Nov 27, 2015
  5. Nov 26, 2015
    • Dilip Biswal's avatar
      [SPARK-11997] [SQL] NPE when save a DataFrame as parquet and partitioned by long column · a374e20b
      Dilip Biswal authored
      Check for partition column null-ability while building the partition spec.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #10001 from dilipbiswal/spark-11997.
      a374e20b
    • Reynold Xin's avatar
      Fix style violation for b63938a8 · 10e315c2
      Reynold Xin authored
      10e315c2
    • Jeremy Derr's avatar
      [SPARK-11991] fixes · 5eaed4e4
      Jeremy Derr authored
      If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster.
      
      This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result.
      
      Author: Jeremy Derr <jcderr@radius.com>
      
      Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.
      5eaed4e4
    • Huaxin Gao's avatar
      [SPARK-11778][SQL] add regression test · 4d4cbc03
      Huaxin Gao authored
      Fix regression test for SPARK-11778.
       marmbrus
      Could you please take a look?
      Thank you very much!!
      
      Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>
      
      Closes #9890 from huaxingao/spark-11778-regression-test.
      4d4cbc03
    • Jeff Zhang's avatar
      [SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark · d8220885
      Jeff Zhang authored
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9903 from zjffdu/SPARK-11917.
      d8220885
    • mariusvniekerk's avatar
      [SPARK-11881][SQL] Fix for postgresql fetchsize > 0 · b63938a8
      mariusvniekerk authored
      Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor
      In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting.
      
      This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran.
      
      Author: mariusvniekerk <marius.v.niekerk@gmail.com>
      
      Closes #9861 from mariusvniekerk/SPARK-11881.
      b63938a8
    • Yanbo Liang's avatar
      [SPARK-12011][SQL] Stddev/Variance etc should support columnName as arguments · 6f6bb0e8
      Yanbo Liang authored
      Spark SQL aggregate function:
      ```Java
      stddev
      stddev_pop
      stddev_samp
      variance
      var_pop
      var_samp
      skewness
      kurtosis
      collect_list
      collect_set
      ```
      should support ```columnName``` as arguments like other aggregate function(max/min/count/sum).
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9994 from yanboliang/SPARK-12011.
      6f6bb0e8
    • Shixiong Zhu's avatar
      [SPARK-11996][CORE] Make the executor thread dump work again · 0c1e72e7
      Shixiong Zhu authored
      In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken.
      
      This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #9976 from zsxwing/executor-thread-dump.
      0c1e72e7
    • muxator's avatar
      doc typo: "classificaion" -> "classification" · 4376b5be
      muxator authored
      Author: muxator <muxator@users.noreply.github.com>
      
      Closes #10008 from muxator/patch-1.
      4376b5be
    • Reynold Xin's avatar
      [SPARK-11973][SQL] Improve optimizer code readability. · de28e4d4
      Reynold Xin authored
      This is a followup for https://github.com/apache/spark/pull/9959.
      
      I added more documentation and rewrote some monadic code into simpler ifs.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9995 from rxin/SPARK-11973.
      de28e4d4
    • Yin Huai's avatar
      [SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from... · ad765623
      Yin Huai authored
      [SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from maven, we need to try to download the version that is used by Spark
      
      If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes.
      
      I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good.
      ```
      build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive
      build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive
      build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive
      build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive
      ```
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9979 from yhuai/versionsSuite.
      ad765623
    • Dilip Biswal's avatar
      [SPARK-11863][SQL] Unable to resolve order by if it contains mixture of aliases and real columns · bc16a675
      Dilip Biswal authored
      this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up.
      
      The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
      For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.
      
      whoever merge this PR, please give the credit to dilipbiswal
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9961 from cloud-fan/sort.
      bc16a675
    • Marcelo Vanzin's avatar
      [SPARK-12005][SQL] Work around VerifyError in HyperLogLogPlusPlus. · 001f0528
      Marcelo Vanzin authored
      Just move the code around a bit; that seems to make the JVM happy.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9985 from vanzin/SPARK-12005.
      001f0528
    • Davies Liu's avatar
      [SPARK-11973] [SQL] push filter through aggregation with alias and literals · 27d69a05
      Davies Liu authored
      Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that.
      
      After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements).
      
      cc nongli  yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9959 from davies/push_filter2.
      27d69a05
Loading