Skip to content
Snippets Groups Projects
  1. Aug 10, 2015
    • Prabeesh K's avatar
      [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python · 853809e9
      Prabeesh K authored
      This PR is based on #4229, thanks prabeesh.
      
      Closes #4229
      
      Author: Prabeesh K <prabsmails@gmail.com>
      Author: zsxwing <zsxwing@gmail.com>
      Author: prabs <prabsmails@gmail.com>
      Author: Prabeesh K <prabeesh.k@namshi.com>
      
      Closes #7833 from zsxwing/pr4229 and squashes the following commits:
      
      9570bec [zsxwing] Fix the variable name and check null in finally
      4a9c79e [zsxwing] Fix pom.xml indentation
      abf5f18 [zsxwing] Merge branch 'master' into pr4229
      935615c [zsxwing] Fix the flaky MQTT tests
      47278c5 [zsxwing] Include the project class files
      478f844 [zsxwing] Add unpack
      5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
      734db99 [zsxwing] Merge branch 'master' into pr4229
      126608a [Prabeesh K] address the comments
      b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
      d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
      a6747cb [Prabeesh K] wait for starting the receiver before publishing data
      87fc677 [Prabeesh K] address the comments:
      97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
      80474d1 [Prabeesh K] fix
      1f0cfe9 [Prabeesh K] python style fix
      e1ee016 [Prabeesh K] scala style fix
      a5a8f9f [Prabeesh K] added Python test
      9767d82 [Prabeesh K] implemented Python-friendly class
      a11968b [Prabeesh K] fixed python style
      795ec27 [Prabeesh K] address comments
      ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
      3f4df12 [Prabeesh K] updated version
      b34c3c1 [prabs] adress comments
      3aa7fff [prabs] Added Python streaming mqtt word count example
      b7d42ff [prabs] Mqtt streaming support in Python
      853809e9
  2. Aug 04, 2015
    • tedyu's avatar
      [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted · b211cbc7
      tedyu authored
      This PR removes the dependency reduced POM hack brought back by #7191
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #7919 from tedyu/master and squashes the following commits:
      
      1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
      b211cbc7
    • Sean Owen's avatar
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build... · 76d74090
      Sean Owen authored
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
      
      Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      
      I'll explain several of the changes inline in comments.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7862 from srowen/SPARK-9534 and squashes the following commits:
      
      ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      76d74090
  3. Aug 03, 2015
    • Steve Loughran's avatar
      [SPARK-8064] [SQL] Build against Hive 1.2.1 · a2409d1c
      Steve Loughran authored
      Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.
      
      Tests not run yet: that's what the machines are for
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Cheng Lian <lian@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:
      
      7556d85 [Cheng Lian] Updates .q files and corresponding golden files
      ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
      6a92bb0 [Cheng Lian] Overrides HiveConf time vars
      dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
      0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
      fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
      7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
      376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
      2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
      cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
      6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
      da310dc [Michael Armbrust] Fixes for Hive tests.
      a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
      7404f34 [Patrick Wendell] Add spark-hive staging repo
      832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
      312c0d4 [Steve Loughran] SPARK-8064  maven/ivy dependency purge; calcite declaration needed
      fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
      c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
      4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
      314eb3c [Steve Loughran] SPARK-8064 deprecation warning  noise in one of the tests
      17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
      d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
      23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
      54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
      0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
      fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
      fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
      dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
      d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
      051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
      6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
      e6121e5 [Steve Loughran] SPARK-8064 address review comments
      aa43dc6 [Steve Loughran] SPARK-8064  more robust teardown on JavaMetastoreDatasourcesSuite
      f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
      8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
      5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
      642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
      97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
      335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
      3ed872f [Steve Loughran] SPARK-8064 rename field double to  dbl
      bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
      41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
      2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
      1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
      bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
      c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
      0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
      13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
      d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
      26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
      3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
      d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
      1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
      8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
      dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
      463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
      2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
      1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
      75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
      3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
      c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
      27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
      00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
      cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
      f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
      6c310b4 [Steve Loughran] SPARK-8064 subclass  Hive ServerOptionsProcessor to make it public again
      f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
      4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
      a2409d1c
  4. Aug 02, 2015
    • Sean Owen's avatar
      [SPARK-9521] [BUILD] Require Maven 3.3.3+ in the build · 9d1c0252
      Sean Owen authored
      Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7852 from srowen/SPARK-9521 and squashes the following commits:
      
      3093039 [Sean Owen] Enforce Maven 3.3.3+ in the build. (Also update the scala compiler plugin while we're at it.)
      9d1c0252
  5. Jul 31, 2015
    • Sean Owen's avatar
      [SPARK-9507] [BUILD] Remove dependency reduced POM hack now that shade plugin is updated · 6e5fd613
      Sean Owen authored
      Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
      
      See https://issues.apache.org/jira/browse/SPARK-8819
      
      I verified that `mvn clean package -DskipTests` works with Maven 3.3.3.
      
      pwendell are you up for trying this for the 1.5.0 release?
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7826 from srowen/SPARK-9507 and squashes the following commits:
      
      e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here
      6e5fd613
    • zsxwing's avatar
      [SPARK-8564] [STREAMING] Add the Python API for Kinesis · 3afc1de8
      zsxwing authored
      This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6955 from zsxwing/kinesis-python and squashes the following commits:
      
      e42e471 [zsxwing] Merge branch 'master' into kinesis-python
      455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
      32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      5082d28 [zsxwing] Fix the syntax error for Python 2.6
      fca416b [zsxwing] Fix wrong comparison
      96670ff [zsxwing] Fix the compilation error after merging master
      756a128 [zsxwing] Merge branch 'master' into kinesis-python
      6c37395 [zsxwing] Print stack trace for debug
      7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
      cc9d071 [zsxwing] Fix the python test errors
      466b425 [zsxwing] Add python tests for Kinesis
      e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      3da2601 [zsxwing] Fix the kinesis folder
      687446b [zsxwing] Fix the error message and the maven output path
      add2beb [zsxwing] Merge branch 'master' into kinesis-python
      4957c0b [zsxwing] Add the Python API for Kinesis
      3afc1de8
  6. Jul 23, 2015
  7. Jul 21, 2015
    • Michael Allman's avatar
      [SPARK-8401] [BUILD] Scala version switching build enhancements · f5b6dc5e
      Michael Allman authored
      These commits address a few minor issues in the Scala cross-version support in the build:
      
        1. Correct two missing `${scala.binary.version}` pom file substitutions.
        2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
        3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
        4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.
      
      This is my original work and I license this work to the Spark project under the Apache License.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #6832 from mallman/scala-versions and squashes the following commits:
      
      cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
      02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
      ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
      bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
      475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
      f5b6dc5e
  8. Jul 19, 2015
  9. Jul 16, 2015
    • Jan Prach's avatar
      [SPARK-9015] [BUILD] Clean project import in scala ide · b536d5dc
      Jan Prach authored
      Cleanup maven for a clean import in scala-ide / eclipse.
      
      * remove groovy plugin which is really not needed at all
      * add-source from build-helper-maven-plugin is not needed as recent version of scala-maven-plugin do it automatically
      * add lifecycle-mapping plugin to hide a few useless warnings from ide
      
      Author: Jan Prach <jendap@gmail.com>
      
      Closes #7375 from jendap/clean-project-import-in-scala-ide and squashes the following commits:
      
      c4b4c0f [Jan Prach] fix whitespaces
      5a83e07 [Jan Prach] Revert "remove java compiler warnings from java tests"
      312007e [Jan Prach] scala-maven-plugin itself add scala sources by default
      f47d856 [Jan Prach] remove spark-1.4-staging repository
      c8a54db [Jan Prach] remove java compiler warnings from java tests
      999a068 [Jan Prach] remove some maven warnings in scala ide
      80fbdc5 [Jan Prach] remove groovy and gmavenplus plugin
      b536d5dc
  10. Jul 15, 2015
    • zsxwing's avatar
      [SPARK-6602][Core]Replace Akka Serialization with Spark Serializer · b9a922e2
      zsxwing authored
      Replace Akka Serialization with Spark Serializer and add unit tests.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7159 from zsxwing/remove-akka-serialization and squashes the following commits:
      
      fc0fca3 [zsxwing] Merge branch 'master' into remove-akka-serialization
      cf81a58 [zsxwing] Fix the code style
      73251c6 [zsxwing] Add test scope
      9ef4af9 [zsxwing] Add AkkaRpcEndpointRef.hashCode
      433115c [zsxwing] Remove final
      be3edb0 [zsxwing] Support deserializing RpcEndpointRef
      ecec410 [zsxwing] Replace Akka Serialization with Spark Serializer
      b9a922e2
  11. Jul 13, 2015
    • Hari Shreedharan's avatar
      [SPARK-8533] [STREAMING] Upgrade Flume to 1.6.0 · 0aed38e4
      Hari Shreedharan authored
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6939 from harishreedharan/upgrade-flume-1.6.0 and squashes the following commits:
      
      94b80ae [Hari Shreedharan] [SPARK-8533][Streaming] Upgrade Flume to 1.6.0
      0aed38e4
  12. Jul 10, 2015
    • Iulian Dragos's avatar
      [SPARK-7944] [SPARK-8013] Remove most of the Spark REPL fork for Scala 2.11 · 11e22b74
      Iulian Dragos authored
      This PR removes most of the code in the Spark REPL for Scala 2.11 and leaves just a couple of overridden methods in `SparkILoop` in order to:
      
      - change welcome message
      - restrict available commands (like `:power`)
      - initialize Spark context
      
      The two codebases have diverged and it's extremely hard to backport fixes from the upstream REPL. This somewhat radical step is absolutely necessary in order to fix other REPL tickets (like SPARK-8013 - Hive Thrift server for 2.11). BTW, the Scala REPL has fixed the serialization-unfriendly wrappers thanks to ScrapCodes's work in [#4522](https://github.com/scala/scala/pull/4522)
      
      All tests pass and I tried the `spark-shell` on our Mesos cluster with some simple jobs (including with additional jars), everything looked good.
      
      As soon as Scala 2.11.7 is out we need to upgrade and get a shaded `jline` dependency, clearing the way for SPARK-8013.
      
      /cc pwendell
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #6903 from dragos/issue/no-spark-repl-fork and squashes the following commits:
      
      c596c6f [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
      2b1a305 [Iulian Dragos] Removed spaces around multiple imports.
      0ce67a6 [Iulian Dragos] Remove -verbose flag for java compiler (added by mistake in an earlier commit).
      10edaf9 [Iulian Dragos] Keep the jline dependency only in the 2.10 build.
      529293b [Iulian Dragos] Add back Spark REPL files to rat-excludes, since they are part of the 2.10 real.
      d85370d [Iulian Dragos] Remove jline dependency from the Spark REPL.
      b541930 [Iulian Dragos] Merge branch 'master' into issue/no-spark-repl-fork
      2b15962 [Iulian Dragos] Change jline dependency and bump Scala version.
      b300183 [Iulian Dragos] Rename package and add license on top of the file, remove files from rat-excludes and removed `-Yrepl-sync` per reviewer’s request.
      9d46d85 [Iulian Dragos] Fix SPARK-7944.
      abcc7cb [Iulian Dragos] Remove the REPL forked code.
      11e22b74
  13. Jul 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8852] [FLUME] Trim dependencies in flume assembly. · 0e78e40c
      Marcelo Vanzin authored
      Also, add support for the *-provided profiles. This avoids repackaging
      things that are already in the Spark assembly, or, in the case of the
      *-provided profiles, are provided by the distribution.
      
      The flume-ng-auth dependency was also excluded since it's not really
      used by Spark.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7247 from vanzin/SPARK-8852 and squashes the following commits:
      
      298a7d5 [Marcelo Vanzin] Feedback.
      c962082 [Marcelo Vanzin] [SPARK-8852] [flume] Trim dependencies in flume assembly.
      0e78e40c
    • Cheng Lian's avatar
      [SPARK-8959] [SQL] [HOTFIX] Removes parquet-thrift and libthrift dependencies · 2d45571f
      Cheng Lian authored
      These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds.
      
      This PR fixes this issue by:
      
      1. Removing these two dependencies, and
      2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource.
      
      This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7330 from liancheng/spark-8959 and squashes the following commits:
      
      cf69512 [Cheng Lian] Brings back Maven builds
      2d45571f
  14. Jul 08, 2015
    • Kousuke Saruta's avatar
      [SPARK-8937] [TEST] A setting `spark.unsafe.exceptionOnMemoryLeak ` is missing in ScalaTest config. · aba5784d
      Kousuke Saruta authored
      `spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
      
      ```
              <!-- Surefire runs all Java tests -->
              <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <!-- Note config is repeated in scalatest config -->
      ...
      
      <spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
                  </systemProperties>
      ...
      ```
      
       but is absent in the config ScalaTest.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the following commits:
      
      95644e7 [Kousuke Saruta] Added a setting for memory leak
      aba5784d
    • Cheng Lian's avatar
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
      Cheng Lian authored
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
      
      This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
      
      ### Major changes
      
      1. `CatalystConverter` class hierarchy refactoring
      
         - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
      
           Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
      
           This simplifies the design since converters don't need to care about details of their parent converters anymore.
      
         - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
      
           Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
      
         - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
      
           `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
      
           The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
      
         - Implements backwards-compatibility rules in `CatalystArrayConverter`
      
           When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
      
      2. Requested columns handling
      
         When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
      
         In this PR, the schema for requested columns is constructed using the following method:
      
         - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
         - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
         - Unions all single-field `MessageType`s into a full schema containing all requested fields
      
         With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
      
      ### Testing
      
      This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
      
      [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
      [2]: https://issues.apache.org/jira/browse/SPARK-6774
      [3]: https://issues.apache.org/jira/browse/SPARK-6123
      [4]: https://issues.apache.org/jira/browse/SPARK-8848
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7231 from liancheng/spark-6776 and squashes the following commits:
      
      360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
      c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
      b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
      598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
      926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
      7946ee1 [Cheng Lian] Fixes Scala styling issues
      3d7ab36 [Cheng Lian] Fixes .rat-excludes
      a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
      f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
      1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
      440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
      13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
      06cfe9d [Cheng Lian] Adds comments about TimestampType handling
      a099d3e [Cheng Lian] More comments
      0cc1b37 [Cheng Lian] Fixes MiMa checks
      884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
      802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
      38fe1e7 [Cheng Lian] Adds explicit return type
      7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
      1781dff [Cheng Lian] Adds test case for SPARK-8811
      6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
      bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
      a74fb2c [Cheng Lian] More comments
      0525346 [Cheng Lian] Removes old Parquet record converters
      03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
      4ffc27ca
  15. Jul 07, 2015
    • Sean Owen's avatar
      [SPARK-6731] [CORE] Addendum: Upgrade Apache commons-math3 to 3.4.1 · dcbd85b7
      Sean Owen authored
      (This finishes the job by removing the version overridden by Hadoop profiles.)
      
      See discussion at https://github.com/apache/spark/pull/6994#issuecomment-119113167
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7261 from srowen/SPARK-6731.2 and squashes the following commits:
      
      5a3f59e [Sean Owen] Finish updating Commons Math3 to 3.4.1 from 3.1.1
      dcbd85b7
    • Patrick Wendell's avatar
      [HOTFIX] Rename release-profile to release · 1cb2629f
      Patrick Wendell authored
      when publishing releases. We named it as 'release-profile' because that is
      the Maven convention. However, it turns out this special name causes several
      other things to kick-in when we are creating releases that are not desirable.
      For instance, it triggers the javadoc plugin to run, which actually fails
      in our current build set-up.
      
      The fix is just to rename this to a different profile to have no
      collateral damage associated with its use.
      1cb2629f
  16. Jul 06, 2015
    • Andrew Or's avatar
      [SPARK-8819] Fix build for maven 3.3.x · 9eae5fa6
      Andrew Or authored
      This is a workaround for MSHADE-148, which leads to an infinite loop when building Spark with maven 3.3.x. This was originally caused by #6441, which added a bunch of test dependencies on the spark-core test module. Recently, it was revealed by #7193.
      
      This patch adds a `-Prelease` profile. If present, it will set `createDependencyReducedPom` to true. The consequences are:
      - If you are releasing Spark with this profile, you are fine as long as you use maven 3.2.x or before.
      - If you are releasing Spark without this profile, you will run into SPARK-8781.
      - If you are not releasing Spark but you are using this profile, you may run into SPARK-8819.
      - If you are not releasing Spark and you did not include this profile, you are fine.
      
      This is all documented in `pom.xml` and tested locally with both versions of maven.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7219 from andrewor14/fix-maven-build and squashes the following commits:
      
      1d37e87 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-maven-build
      3574ae4 [Andrew Or] Review comments
      f39199c [Andrew Or] Create a -Prelease profile that flags `createDependencyReducedPom`
      9eae5fa6
  17. Jul 02, 2015
    • Andrew Or's avatar
      [SPARK-8781] Fix variables in published pom.xml are not resolved · 82cf3315
      Andrew Or authored
      The issue is summarized in the JIRA and is caused by this commit: 984ad601.
      
      This patch reverts that commit and fixes the maven build in a different way. We limit the dependencies of `KinesisReceiverSuite` to avoid having to deal with the complexities in how maven deals with transitive test dependencies.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7193 from andrewor14/fix-kinesis-pom and squashes the following commits:
      
      ca3d5d4 [Andrew Or] Limit kinesis test dependencies
      f24e09c [Andrew Or] Revert "[BUILD] Fix Maven build for Kinesis"
      82cf3315
  18. Jul 01, 2015
    • zsxwing's avatar
      [SPARK-8378] [STREAMING] Add the Python API for Flume · 75b9fe4c
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6830 from zsxwing/flume-python and squashes the following commits:
      
      78dfdac [zsxwing] Fix the compile error in the test code
      f1bf3c0 [zsxwing] Address TD's comments
      0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
      e93736b [zsxwing] Fix the test case for determine_modules_to_test
      9d5821e [zsxwing] Fix pyspark_core dependencies
      f9ee681 [zsxwing] Merge branch 'master' into flume-python
      7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
      b96b0de [zsxwing] Merge branch 'master' into flume-python
      ce85e83 [zsxwing] Fix incompatible issues for Python 3
      01cbb3d [zsxwing] Add import sys
      152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
      14ba0ff [zsxwing] Add flume-assembly for sbt building
      b8d5551 [zsxwing] Merge branch 'master' into flume-python
      4762c34 [zsxwing] Fix the doc
      0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
      9f33873 [zsxwing] Add the Python API for Flume
      75b9fe4c
  19. Jun 29, 2015
    • Josh Rosen's avatar
      [SPARK-8709] Exclude hadoop-client's mockito-all dependency · 27ef8545
      Josh Rosen authored
      This patch excludes `hadoop-client`'s dependency on `mockito-all`.  As of #7061, Spark depends on `mockito-core` instead of `mockito-all`, so the dependency from Hadoop was leading to test compilation failures for some of the Hadoop 2 SBT builds.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7090 from JoshRosen/SPARK-8709 and squashes the following commits:
      
      e190122 [Josh Rosen] [SPARK-8709] Exclude hadoop-client's mockito-all dependency.
      27ef8545
  20. Jun 28, 2015
  21. Jun 22, 2015
    • Davies Liu's avatar
      [SPARK-8307] [SQL] improve timestamp from parquet · 6b7f2cea
      Davies Liu authored
      This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).
      
      cc adrian-wang rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6759 from davies/improve_ts and squashes the following commits:
      
      849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      8e2d56f [Davies Liu] address comments
      634b9f5 [Davies Liu] fix mima
      4891efb [Davies Liu] address comment
      bfc437c [Davies Liu] fix build
      ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      602b969 [Davies Liu] remove jodd
      2f2e48c [Davies Liu] fix test
      8ace611 [Davies Liu] fix mima
      212143b [Davies Liu] fix mina
      c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      5233974 [Davies Liu] fix scala style
      361fd62 [Davies Liu] address comments
      ea196d4 [Davies Liu] improve timestamp from parquet
      6b7f2cea
  22. Jun 11, 2015
    • Adam Roberts's avatar
      [SPARK-8289] Specify stack size for consistency with Java tests - resolves test failures · 6b68366d
      Adam Roberts authored
      This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).
      
      -Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).
      
      I have ensured this does not have any negative implications for other tests.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      Author: a-roberts <aroberts@uk.ibm.com>
      
      Closes #6727 from a-roberts/IncJavaStackSize and squashes the following commits:
      
      ab40aea [Adam Roberts] Specify stack size for SBT builds
      5032d8d [a-roberts] Update pom.xml
      6b68366d
  23. Jun 09, 2015
  24. Jun 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8126] [BUILD] Use custom temp directory during build. · a1d9e5cc
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6674 from vanzin/SPARK-8126 and squashes the following commits:
      
      0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
      643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
      a1d9e5cc
  25. Jun 07, 2015
    • Sean Owen's avatar
      [SPARK-7733] [CORE] [BUILD] Update build, code to use Java 7 for 1.5.0+ · e84815dc
      Sean Owen authored
      Update build to use Java 7, and remove some comments and special-case support for Java 6.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6265 from srowen/SPARK-7733 and squashes the following commits:
      
      59bda4e [Sean Owen] Update build to use Java 7, and remove some comments and special-case support for Java 6
      e84815dc
    • Konstantin Shaposhnikov's avatar
      [SPARK-7042] [BUILD] use the standard akka artifacts with hadoop-2.x · ca8dafcc
      Konstantin Shaposhnikov authored
      Both akka 2.3.x and hadoop-2.x use protobuf 2.5 so only hadoop-1 build needs
      custom 2.3.4-spark akka version that shades protobuf-2.5
      
      This change also updates akka version (for hadoop-2.x profiles only) to the
      latest 2.3.11 as akka-zeromq_2.11 is not available for akka 2.3.4.
      
      This partially fixes SPARK-7042 (for hadoop-2.x builds)
      
      Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com>
      
      Closes #6492 from kostya-sh/SPARK-7042 and squashes the following commits:
      
      dc195b0 [Konstantin Shaposhnikov] [SPARK-7042] [BUILD] use the standard akka artifacts with hadoop-2.x
      ca8dafcc
  26. Jun 05, 2015
    • Andrew Or's avatar
      Revert "[MINOR] [BUILD] Use custom temp directory during build." · 4036d05c
      Andrew Or authored
      This reverts commit b16b5434.
      4036d05c
    • Marcelo Vanzin's avatar
      [MINOR] [BUILD] Use custom temp directory during build. · b16b5434
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6653 from vanzin/unit-test-tmp and squashes the following commits:
      
      31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other.
      aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build.
      b16b5434
  27. Jun 04, 2015
    • Josh Rosen's avatar
      [SPARK-8106] [SQL] Set derby.system.durability=test to speed up Hive compatibility tests · 74dc2a90
      Josh Rosen authored
      Derby has a `derby.system.durability` configuration property that can be used to disable I/O synchronization calls for writes. This sacrifices durability but can result in large performance gains, which is appropriate for tests.
      
      We should enable this in our test system properties in order to speed up the Hive compatibility tests. I saw 2-3x speedups locally with this change.
      
      See https://db.apache.org/derby/docs/10.8/ref/rrefproperdurability.html for more documentation of this property.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6651 from JoshRosen/hive-compat-suite-speedup and squashes the following commits:
      
      b7a08a2 [Josh Rosen] Set derby.system.durability=test in our unit tests.
      74dc2a90
    • Thomas Omans's avatar
      [SPARK-7743] [SQL] Parquet 1.7 · cd3176bd
      Thomas Omans authored
      Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).
      
      Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`
      
      ```diff
      -    val readContext = getReadSupport(configuration).init(
      +    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
      ```
      
      Since ParquetInputFormat.getReadSupport was made package private in the latest release.
      
      Thanks
      -- Thomas Omans
      
      Author: Thomas Omans <tomans@cj.com>
      
      Closes #6597 from eggsby/SPARK-7743 and squashes the following commits:
      
      2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
      cd3176bd
    • Davies Liu's avatar
      [SPARK-7956] [SQL] Use Janino to compile SQL expressions into bytecode · c8709dcf
      Davies Liu authored
      In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.
      
      After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6479 from davies/janino and squashes the following commits:
      
      cc689f5 [Davies Liu] remove globalLock
      262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      eec3a33 [Davies Liu] address comments from Josh
      f37c8c3 [Davies Liu] fix DecimalType and cast to String
      202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      a21e968 [Davies Liu] fix style
      0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      551a851 [Davies Liu] fix tests
      c3bdffa [Davies Liu] remove print
      6089ce5 [Davies Liu] change logging level
      7e46ac3 [Davies Liu] fix style
      d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      da4926a [Davies Liu] fix tests
      03660f3 [Davies Liu] WIP: use Janino to compile Java source
      f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      f7d66cf [Davies Liu] use template based string for codegen
      c8709dcf
  28. Jun 03, 2015
    • Andrew Or's avatar
      [BUILD] Fix Maven build for Kinesis · 984ad601
      Andrew Or authored
      A necessary dependency that is transitively referenced is not
      provided, causing compilation failures in builds that provide
      the kinesis-asl profile.
      984ad601
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
Loading