Skip to content
Snippets Groups Projects
  1. Nov 12, 2015
  2. Nov 11, 2015
    • Xiangrui Meng's avatar
      [SPARK-11674][ML] add private val after @transient in Word2VecModel · e2957bc0
      Xiangrui Meng authored
      This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9644 from mengxr/SPARK-11674.
      e2957bc0
    • Daoyuan Wang's avatar
      [SPARK-11396] [SQL] add native implementation of datetime function to_unix_timestamp · 39b1e36f
      Daoyuan Wang authored
      `to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters.
      
      Since the behavior here is quite similar to `unix_timestamp`, I think the dataframe API is not necessary here.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #9347 from adrian-wang/to_unix_timestamp.
      39b1e36f
    • Reynold Xin's avatar
      [SPARK-11675][SQL] Remove shuffle hash joins. · e49e7233
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9645 from rxin/SPARK-11675.
      e49e7233
    • Andrew Ray's avatar
      [SPARK-8992][SQL] Add pivot to dataframe api · b8ff6888
      Andrew Ray authored
      This adds a pivot method to the dataframe api.
      
      Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer.
      
      Currently the syntax is like:
      ~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~
      
      ~~Would we be interested in the following syntax also/alternatively? and~~
      
          courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings"))
          //or
          courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings"))
      
      Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right?
      
      ~~Also what would be the suggested Java friendly method signature for this?~~
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #7841 from aray/sql-pivot.
      b8ff6888
    • Xiangrui Meng's avatar
      [SPARK-11672][ML] disable spark.ml read/write tests · 1a21be15
      Xiangrui Meng authored
      Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build:
      
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/
      
      I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9641 from mengxr/SPARK-11672.
      1a21be15
    • Reynold Xin's avatar
      [SPARK-10827] replace volatile with Atomic* in AppClient.scala. · e1bcf6af
      Reynold Xin authored
      This is a followup for #9317 to replace volatile fields with AtomicBoolean and AtomicReference.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9611 from rxin/SPARK-10827.
      e1bcf6af
    • Josh Rosen's avatar
      [SPARK-11647] Attempt to reduce time/flakiness of Thriftserver CLI and SparkSubmit tests · 2d76e44b
      Josh Rosen authored
      This patch aims to reduce the test time and flakiness of HiveSparkSubmitSuite, SparkSubmitSuite, and CliSuite.
      
      Key changes:
      
      - Disable IO synchronization calls for Derby writes, since durability doesn't matter for tests. This was done for HiveCompatibilitySuite in #6651 and resulted in huge test speedups.
      - Add a few missing `--conf`s to disable various Spark UIs. The CliSuite, in particular, never disabled these UIs, leaving it prone to port-contention-related flakiness.
      - Fix two instances where tests defined `beforeAll()` methods which were never called because the appropriate traits were not mixed in. I updated these tests suites to extend `BeforeAndAfterEach` so that they play nicely with our `ResetSystemProperties` trait.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9623 from JoshRosen/SPARK-11647.
      2d76e44b
    • Nick Evans's avatar
      [SPARK-11335][STREAMING] update kafka direct python docs on how to get the... · dd77e278
      Nick Evans authored
      [SPARK-11335][STREAMING] update kafka direct python docs on how to get the offset ranges for a KafkaRDD
      
      tdas koeninger
      
      This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python.
      
      Author: Nick Evans <me@nicolasevans.org>
      
      Closes #9289 from manygrams/update_kafka_direct_python_docs.
      dd77e278
    • Reynold Xin's avatar
      [SPARK-11645][SQL] Remove OpenHashSet for the old aggregate. · a9a6b80c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9621 from rxin/SPARK-11645.
      a9a6b80c
    • Reynold Xin's avatar
      [SPARK-11644][SQL] Remove the option to turn off unsafe and codegen. · df97df2b
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9618 from rxin/SPARK-11644.
      df97df2b
    • Burak Yavuz's avatar
      [SPARK-11639][STREAMING][FLAKY-TEST] Implement BlockingWriteAheadLog for... · 27029bc8
      Burak Yavuz authored
      [SPARK-11639][STREAMING][FLAKY-TEST] Implement BlockingWriteAheadLog for testing the BatchedWriteAheadLog
      
      Several elements could be drained if the main thread is not fast enough. zsxwing warned me about a similar problem, but missed it here :( Submitting the fix using a waiter.
      
      cc tdas
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #9605 from brkyvz/fix-flaky-test.
      27029bc8
    • Josh Rosen's avatar
      [SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled classes · 529a1d33
      Josh Rosen authored
      This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.
      
      In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.
      
      http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.
      
      I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9512 from JoshRosen/SPARK-6152.
      529a1d33
    • Wenchen Fan's avatar
      [SQL][MINOR] remove newLongEncoder in functions · e71ba565
      Wenchen Fan authored
      it may shadows the one from implicits in some case.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9629 from cloud-fan/minor.
      e71ba565
    • Wenchen Fan's avatar
      [SPARK-11564][SQL][FOLLOW-UP] clean up java tuple encoder · ec2b8072
      Wenchen Fan authored
      We need to support custom classes like java beans and combine them into tuple, and it's very hard to do it with the  TypeTag-based approach.
      We should keep only the compose-based way to create tuple encoder.
      
      This PR also move `Encoder` to `org.apache.spark.sql`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9567 from cloud-fan/java.
      ec2b8072
    • Wenchen Fan's avatar
      [SPARK-11656][SQL] support typed aggregate in project list · 9c57bc0e
      Wenchen Fan authored
      insert `aEncoder` like we do in `agg`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9630 from cloud-fan/select.
      9c57bc0e
    • Wenchen Fan's avatar
      [SQL][MINOR] rename present to finish in Aggregator · c964fc10
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9617 from cloud-fan/tmp.
      c964fc10
    • Reynold Xin's avatar
      [SPARK-11646] WholeTextFileRDD should return Text rather than String · 95daff64
      Reynold Xin authored
      If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile data source and directly convert the Text into UTF8String without extra string decoding and encoding.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9622 from rxin/SPARK-11646.
      95daff64
    • Yuming Wang's avatar
      [SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slow · 27524a3a
      Yuming Wang authored
      org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence.
      
      Author: Yuming Wang <q79969786@gmail.com>
      Author: yuming.wang <q79969786@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9592 from 979969786/master.
      27524a3a
    • Wenchen Fan's avatar
      [SPARK-10371][SQL][FOLLOW-UP] fix code style · 1510c527
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9627 from cloud-fan/follow.
      1510c527
    • hyukjinkwon's avatar
      [SPARK-11500][SQL] Not deterministic order of columns when using merging schemas. · 1bc41125
      hyukjinkwon authored
      https://issues.apache.org/jira/browse/SPARK-11500
      
      As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns.
      
      This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order.
      
      Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist.
      
      One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files.
      
      However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas.
      
      In addition, in the test code for this, I only checked the names of fields.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #9517 from HyukjinKwon/SPARK-11500.
      1bc41125
    • Tathagata Das's avatar
      [SPARK-11290][STREAMING] Basic implementation of trackStateByKey · 99f5f988
      Tathagata Das authored
      Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
      * Need for more optimized state management that does not scan every key
      * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state
      
      The high level idea that of this PR
      * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
      * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
      Here is the detailed design doc. Please take a look and provide feedback as comments.
      https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
      
      This is still WIP. Major things left to be done.
      - [x] Implement basic functionality of state tracking, with initial RDD and timeouts
      - [x] Unit tests for state tracking
      - [x] Unit tests for initial RDD and timeout
      - [ ] Unit tests for TrackStateRDD
             - [x] state creating, updating, removing
             - [ ] emitting
             - [ ] checkpointing
      - [x] Misc unit tests for State, TrackStateSpec, etc.
      - [x] Update docs and experimental tags
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #9256 from tdas/trackStateByKey.
      99f5f988
    • Davies Liu's avatar
      [SPARK-11463] [PYSPARK] only install signal in main thread · bd70244b
      Davies Liu authored
      Only install signal in main thread, or it will fail to create context in not-main thread.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9574 from davies/python_signal.
      bd70244b
    • felixcheung's avatar
      [SPARK-11468] [SPARKR] add stddev/variance agg functions for Column · 1a8e0468
      felixcheung authored
      Checked names, none of them should conflict with anything in base
      
      shivaram davies rxin
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9489 from felixcheung/rstddev.
      1a8e0468
    • Josh Rosen's avatar
      [SPARK-10192][HOTFIX] Fix NPE in test that was added in #8402 · fac53d8e
      Josh Rosen authored
      This fixes an NPE introduced in SPARK-10192 / #8402.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9620 from JoshRosen/SPARK-10192-hotfix.
      fac53d8e
  3. Nov 10, 2015
Loading