Skip to content
Snippets Groups Projects
  1. Nov 23, 2015
    • Josh Rosen's avatar
      [SPARK-9866][SQL] Speed up VersionsSuite by using persistent Ivy cache · 9db5f601
      Josh Rosen authored
      This patch attempts to speed up VersionsSuite by storing fetched Hive JARs in an Ivy cache that persists across tests runs. If `SPARK_VERSIONS_SUITE_IVY_PATH` is set, that path will be used for the cache; if it is not set, VersionsSuite will create a temporary Ivy cache which is deleted after the test completes.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9624 from JoshRosen/SPARK-9866.
      9db5f601
    • Marcelo Vanzin's avatar
      [SPARK-11140][CORE] Transfer files using network lib when using NettyRpcEnv. · c2467dad
      Marcelo Vanzin authored
      This change abstracts the code that serves jars / files to executors so that
      each RpcEnv can have its own implementation; the akka version uses the existing
      HTTP-based file serving mechanism, while the netty versions uses the new
      stream support added to the network lib, which makes file transfers benefit
      from the easier security configuration of the network library, and should also
      reduce overhead overall.
      
      The change includes a small fix to TransportChannelHandler so that it propagates
      user events to downstream handlers.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9530 from vanzin/SPARK-11140.
      c2467dad
    • Marcelo Vanzin's avatar
      [SPARK-11865][NETWORK] Avoid returning inactive client in TransportClientFactory. · 7cfa4c6b
      Marcelo Vanzin authored
      There's a very narrow race here where it would be possible for the timeout handler
      to close a channel after the client factory verified that the channel was still
      active. This change makes sure the client is marked as being recently in use so
      that the timeout handler does not close it until a new timeout cycle elapses.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9853 from vanzin/SPARK-11865.
      7cfa4c6b
    • Luciano Resende's avatar
      [SPARK-11910][STREAMING][DOCS] Update twitter4j dependency version · 242be7da
      Luciano Resende authored
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #9892 from lresende/SPARK-11910.
      242be7da
    • Davies Liu's avatar
      [SPARK-11836][SQL] udf/cast should not create new SQLContext · 1d912020
      Davies Liu authored
      They should use the existing SQLContext.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9914 from davies/create_udf.
      1d912020
    • Josh Rosen's avatar
      [SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests · 1b6e938b
      Josh Rosen authored
      This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9865 from JoshRosen/SPARK-4424.
      1b6e938b
    • Mortada Mehyar's avatar
      [SPARK-11837][EC2] python3 compatibility for launching ec2 m3 instances · f6dcc6e9
      Mortada Mehyar authored
      this currently breaks for python3 because `string` module doesn't have `letters` anymore, instead `ascii_letters` should be used
      
      Author: Mortada Mehyar <mortada.mehyar@gmail.com>
      
      Closes #9797 from mortada/python3_fix.
      f6dcc6e9
    • Yanbo Liang's avatar
      [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in... · 98d7ec7d
      Yanbo Liang authored
      [SPARK-11920][ML][DOC] ML LinearRegression should use correct dataset in examples and user guide doc
      
      ML ```LinearRegression``` use ```data/mllib/sample_libsvm_data.txt``` as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use ```data/mllib/sample_linear_regression_data.txt``` instead.
      The deeper causes is that ```LinearRegression``` with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at [SPARK-11918](https://issues.apache.org/jira/browse/SPARK-11918).
      It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of ```LinearRegression``` algorithm.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9905 from yanboliang/spark-11920.
      98d7ec7d
    • Marcelo Vanzin's avatar
      [SPARK-11762][NETWORK] Account for active streams when couting outstanding requests. · 5231cd5a
      Marcelo Vanzin authored
      This way the timeout handling code can correctly close "hung" channels that are
      processing streams.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9747 from vanzin/SPARK-11762.
      5231cd5a
    • jerryshao's avatar
      [SPARK-7173][YARN] Add label expression support for application master · 5fd86e4f
      jerryshao authored
      Add label expression support for AM to restrict it runs on the specific set of nodes. I tested it locally and works fine.
      
      sryza and vanzin please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #9800 from jerryshao/SPARK-7173.
      5fd86e4f
    • Wenchen Fan's avatar
      [SPARK-11913][SQL] support typed aggregate with complex buffer schema · 946b4065
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9898 from cloud-fan/agg.
      946b4065
    • Wenchen Fan's avatar
      [SPARK-11921][SQL] fix `nullable` of encoder schema · f2996e0d
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9906 from cloud-fan/nullable.
      f2996e0d
    • Wenchen Fan's avatar
      [SPARK-11894][SQL] fix isNull for GetInternalRowField · 1a5baaa6
      Wenchen Fan authored
      We should use `InternalRow.isNullAt` to check if the field is null before calling `InternalRow.getXXX`
      
      Thanks gatorsmile who discovered this bug.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9904 from cloud-fan/null.
      1a5baaa6
    • Xiu Guo's avatar
      [SPARK-11628][SQL] support column datatype of char(x) to recognize HiveChar · 94ce65df
      Xiu Guo authored
      Can someone review my code to make sure I'm not missing anything? Thanks!
      
      Author: Xiu Guo <xguo27@gmail.com>
      Author: Xiu Guo <guoxi@us.ibm.com>
      
      Closes #9612 from xguo27/SPARK-11628.
      94ce65df
    • BenFradet's avatar
      [SPARK-11902][ML] Unhandled case in VectorAssembler#transform · 4be360d4
      BenFradet authored
      There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT.
      
      So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType".
      
      This PR aims to fix this, throwing a SparkException when dealing with an unknown column type.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #9885 from BenFradet/SPARK-11902.
      4be360d4
  2. Nov 22, 2015
  3. Nov 21, 2015
  4. Nov 20, 2015
    • Xiangrui Meng's avatar
    • Michael Armbrust's avatar
      [HOTFIX] Fix Java Dataset Tests · 47815878
      Michael Armbrust authored
      47815878
    • Michael Armbrust's avatar
      [SPARK-11890][SQL] Fix compilation for Scala 2.11 · 68ed0468
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9871 from marmbrus/scala211-break.
      68ed0468
    • Michael Armbrust's avatar
      [SPARK-11889][SQL] Fix type inference for GroupedDataset.agg in REPL · 968acf3b
      Michael Armbrust authored
      In this PR I delete a method that breaks type inference for aggregators (only in the REPL)
      
      The error when this method is present is:
      ```
      <console>:38: error: missing parameter type for expanded function ((x$2) => x$2._2)
                    ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect()
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9870 from marmbrus/dataset-repl-agg.
      968acf3b
    • Nong Li's avatar
      [SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch. · 58b4e4f8
      Nong Li authored
      This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is
      shared between core and I've left that in core. This allows some other associated
      minor cleanup.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #9845 from nongli/spark-11787.
      58b4e4f8
    • Vikas Nelamangala's avatar
      [SPARK-11549][DOCS] Replace example code in mllib-evaluation-metrics.md using include_example · ed47b1e6
      Vikas Nelamangala authored
      Author: Vikas Nelamangala <vikasnelamangala@Vikass-MacBook-Pro.local>
      
      Closes #9689 from vikasnp/master.
      ed47b1e6
    • Michael Armbrust's avatar
      [SPARK-11636][SQL] Support classes defined in the REPL with Encoders · 4b84c72d
      Michael Armbrust authored
      #theScaryParts (i.e. changes to the repl, executor classloaders and codegen)...
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9825 from marmbrus/dataset-replClasses2.
      4b84c72d
    • felixcheung's avatar
      [SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help... · a6239d58
      felixcheung authored
      [SPARK-11756][SPARKR] Fix use of aliases - SparkR can not output help information for SparkR:::summary correctly
      
      Fix use of aliases and changes uses of rdname and seealso
      `aliases` is the hint for `?` - it should not be linked to some other name - those should be seealso
      https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html
      
      Clean up usage on family, as multiple use of family with the same rdname is causing duplicated See Also html blocks (like http://spark.apache.org/docs/latest/api/R/count.html)
      Also changing some rdname for dplyr-like variant for better R user visibility in R doc, eg. rbind, summary, mutate, summarize
      
      shivaram yanboliang
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9750 from felixcheung/rdocaliases.
      a6239d58
    • Jean-Baptiste Onofré's avatar
      [SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating... · 03ba56d7
      Jean-Baptiste Onofré authored
      [SPARK-11716][SQL] UDFRegistration just drops the input type when re-creating the UserDefinedFunction
      
      https://issues.apache.org/jira/browse/SPARK-11716
      
      This is one is #9739 and a regression test. When commit it, please make sure the author is jbonofre.
      
      You can find the original PR at https://github.com/apache/spark/pull/9739
      
      closes #9739
      
      Author: Jean-Baptiste Onofré <jbonofre@apache.org>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9868 from yhuai/SPARK-11716.
      03ba56d7
    • Josh Rosen's avatar
      [SPARK-11887] Close PersistenceEngine at the end of PersistenceEngineSuite tests · 89fd9bd0
      Josh Rosen authored
      In PersistenceEngineSuite, we do not call `close()` on the PersistenceEngine at the end of the test. For the ZooKeeperPersistenceEngine, this causes us to leak a ZooKeeper client, causing the logs of unrelated tests to be periodically spammed with connection error messages from that client:
      
      ```
      15/11/20 05:13:35.789 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:15741. Will not attempt to authenticate using SASL (unknown error)
      15/11/20 05:13:35.790 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) WARN ClientCnxn: Session 0x15124ff48dd0000 for server null, unexpected error, closing socket connection and attempting reconnect
      java.net.ConnectException: Connection refused
      	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
      	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
      	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
      ```
      
      This patch fixes this by using a `finally` block.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9864 from JoshRosen/close-zookeeper-client-in-tests.
      89fd9bd0
    • Shixiong Zhu's avatar
      [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction... · be7a2cfd
      Shixiong Zhu authored
      [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer
      
      TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #9847 from zsxwing/pyspark-streaming-exception.
      be7a2cfd
    • Nong Li's avatar
      [SPARK-11724][SQL] Change casting between int and timestamp to consistently treat int in seconds. · 9ed4ad42
      Nong Li authored
      Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong Li <nongli@gmail.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9685 from nongli/spark-11724.
      9ed4ad42
    • Josh Rosen's avatar
      [SPARK-11650] Reduce RPC timeouts to speed up slow AkkaUtilsSuite test · 652def31
      Josh Rosen authored
      This patch reduces some RPC timeouts in order to speed up the slow "AkkaUtilsSuite.remote fetch ssl on - untrusted server", which used to take two minutes to run.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9869 from JoshRosen/SPARK-11650.
      652def31
    • Wenchen Fan's avatar
      [SPARK-11819][SQL] nice error message for missing encoder · 3b9d2a34
      Wenchen Fan authored
      before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`.
      
      After this PR, the error message become more friendly, for example:
      ```
      No Encoder found for abc.xyz.NonEncodable
      - array element class: "abc.xyz.NonEncodable"
      - field (class: "scala.Array", name: "arrayField")
      - root class: "abc.xyz.AnotherClass"
      ```
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9810 from cloud-fan/error-message.
      3b9d2a34
    • Liang-Chi Hsieh's avatar
      [SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULL · 60bfb113
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-11817
      
      Instead of return None, we should truncate the fractional seconds to prevent inserting NULL.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9834 from viirya/truncate-fractional-sec.
      60bfb113
    • gatorsmile's avatar
      [SPARK-11876][SQL] Support printSchema in DataSet API · bef361c5
      gatorsmile authored
      DataSet APIs look great! However, I am lost when doing multiple level joins.  For example,
      ```
      val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a")
      val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b")
      val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c")
      
      ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema()
      ```
      
      The printed schema is like
      ```
      root
       |-- _1: struct (nullable = true)
       |    |-- _1: struct (nullable = true)
       |    |    |-- _1: string (nullable = true)
       |    |    |-- _2: integer (nullable = true)
       |    |-- _2: struct (nullable = true)
       |    |    |-- _1: string (nullable = true)
       |    |    |-- _2: integer (nullable = true)
       |-- _2: struct (nullable = true)
       |    |-- _1: string (nullable = true)
       |    |-- _2: integer (nullable = true)
      ```
      
      Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema:
      ```
      newDS.select(expr("_1._2._2 + 1").as[Int]).collect()
      ```
      
      marmbrus rxin cloud-fan  Do you have the same feeling?
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #9855 from gatorsmile/printSchemaDataSet.
      bef361c5
Loading