Skip to content
Snippets Groups Projects
  1. Aug 10, 2015
    • Prabeesh K's avatar
      [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python · 853809e9
      Prabeesh K authored
      This PR is based on #4229, thanks prabeesh.
      
      Closes #4229
      
      Author: Prabeesh K <prabsmails@gmail.com>
      Author: zsxwing <zsxwing@gmail.com>
      Author: prabs <prabsmails@gmail.com>
      Author: Prabeesh K <prabeesh.k@namshi.com>
      
      Closes #7833 from zsxwing/pr4229 and squashes the following commits:
      
      9570bec [zsxwing] Fix the variable name and check null in finally
      4a9c79e [zsxwing] Fix pom.xml indentation
      abf5f18 [zsxwing] Merge branch 'master' into pr4229
      935615c [zsxwing] Fix the flaky MQTT tests
      47278c5 [zsxwing] Include the project class files
      478f844 [zsxwing] Add unpack
      5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
      734db99 [zsxwing] Merge branch 'master' into pr4229
      126608a [Prabeesh K] address the comments
      b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
      d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
      a6747cb [Prabeesh K] wait for starting the receiver before publishing data
      87fc677 [Prabeesh K] address the comments:
      97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
      80474d1 [Prabeesh K] fix
      1f0cfe9 [Prabeesh K] python style fix
      e1ee016 [Prabeesh K] scala style fix
      a5a8f9f [Prabeesh K] added Python test
      9767d82 [Prabeesh K] implemented Python-friendly class
      a11968b [Prabeesh K] fixed python style
      795ec27 [Prabeesh K] address comments
      ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
      3f4df12 [Prabeesh K] updated version
      b34c3c1 [prabs] adress comments
      3aa7fff [prabs] Added Python streaming mqtt word count example
      b7d42ff [prabs] Mqtt streaming support in Python
      853809e9
    • Reynold Xin's avatar
      [SPARK-9763][SQL] Minimize exposure of internal SQL classes. · 40ed2af5
      Reynold Xin authored
      There are a few changes in this pull request:
      
      1. Moved all data sources to execution.datasources, except the public JDBC APIs.
      2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution.
      3. Moved ui and metric package into execution.
      4. Added more documentation on some internal classes.
      5. Renamed DataSourceRegister.format -> shortName.
      6. Added "override" modifier on shortName.
      7. Removed IntSQLMetric.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8056 from rxin/SPARK-9763 and squashes the following commits:
      
      9df4801 [Reynold Xin] Removed hardcoded name in test cases.
      d9babc6 [Reynold Xin] Shorten.
      e484419 [Reynold Xin] Removed VisibleForTesting.
      171b812 [Reynold Xin] MimaExcludes.
      2041389 [Reynold Xin] Compile ...
      79dda42 [Reynold Xin] Compile.
      0818ba3 [Reynold Xin] Removed IntSQLMetric.
      c46884f [Reynold Xin] Two more fixes.
      f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes.
      40ed2af5
  2. Aug 04, 2015
  3. Aug 03, 2015
    • Andrew Or's avatar
      [SPARK-1855] Local checkpointing · b41a3271
      Andrew Or authored
      Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*.
      
      **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.
      
      The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7279 from andrewor14/local-checkpoint and squashes the following commits:
      
      729600f [Andrew Or] Oops, fix tests
      34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint
      e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      3be5aea [Andrew Or] Address comments
      bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      ab003a3 [Andrew Or] Fix compile
      c2e111b [Andrew Or] Address comments
      33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      e908a42 [Andrew Or] Fix tests
      f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level
      a92657d [Andrew Or] Update a few comments
      e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      1bbe154 [Andrew Or] Simplify LocalCheckpointRDD
      48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests
      62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD
      87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      c449b38 [Andrew Or] Fix style
      4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing
      53b363b [Andrew Or] Rename a few more awkwardly named methods (minor)
      e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups
      4880deb [Andrew Or] Fix style
      d096c67 [Andrew Or] Fix mima
      172cb66 [Andrew Or] Fix mima?
      e53d964 [Andrew Or] Fix style
      56831c5 [Andrew Or] Add a few warnings and clear exception messages
      2e59646 [Andrew Or] Add local checkpoint clean up tests
      4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing
      4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups
      0477eec [Andrew Or] Rename a few methods with awkward names (minor)
      2e902e5 [Andrew Or] First implementation of local checkpointing
      8447454 [Andrew Or] Fix tests
      4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
      b41a3271
  4. Aug 01, 2015
    • Andrew Or's avatar
      [SPARK-4751] Dynamic allocation in standalone mode · 6688ba6e
      Andrew Or authored
      Dynamic allocation is a feature that allows a Spark application to scale the number of executors up and down dynamically based on the workload. Support was first introduced in YARN since 1.2, and then extended to Mesos coarse-grained mode recently. Today, it is finally supported in standalone mode as well!
      
      I tested this locally and it works as expected. This is WIP because unit tests are coming.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7532 from andrewor14/standalone-da and squashes the following commits:
      
      b3c1736 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      879e928 [Andrew Or] Add end-to-end tests for standalone dynamic allocation
      accc8f6 [Andrew Or] Address comments
      ee686a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      c0a2c02 [Andrew Or] Fix build after merge conflict
      24149eb [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2e762d6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      6832bd7 [Andrew Or] Add tests for scheduling with executor limit
      a82e907 [Andrew Or] Fix comments
      0a8be79 [Andrew Or] Simplify logic by removing the worker blacklist
      b7742af [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      2eb5f3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-da
      1334e9a [Andrew Or] Fix MiMa
      32abe44 [Andrew Or] Fix style
      58cb06f [Andrew Or] Privatize worker blacklist for cleanliness
      42ac215 [Andrew Or] Clean up comments and rewrite code for readability
      49702d1 [Andrew Or] Clean up shuffle files after application exits
      80047aa [Andrew Or] First working implementation
      6688ba6e
  5. Jul 31, 2015
    • zsxwing's avatar
      [SPARK-8564] [STREAMING] Add the Python API for Kinesis · 3afc1de8
      zsxwing authored
      This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6955 from zsxwing/kinesis-python and squashes the following commits:
      
      e42e471 [zsxwing] Merge branch 'master' into kinesis-python
      455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
      32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      5082d28 [zsxwing] Fix the syntax error for Python 2.6
      fca416b [zsxwing] Fix wrong comparison
      96670ff [zsxwing] Fix the compilation error after merging master
      756a128 [zsxwing] Merge branch 'master' into kinesis-python
      6c37395 [zsxwing] Print stack trace for debug
      7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
      cc9d071 [zsxwing] Fix the python test errors
      466b425 [zsxwing] Add python tests for Kinesis
      e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      3da2601 [zsxwing] Fix the kinesis folder
      687446b [zsxwing] Fix the error message and the maven output path
      add2beb [zsxwing] Merge branch 'master' into kinesis-python
      4957c0b [zsxwing] Add the Python API for Kinesis
      3afc1de8
  6. Jul 24, 2015
    • Reynold Xin's avatar
      [build] Enable memory leak detection for Tungsten. · 8fe32b4f
      Reynold Xin authored
      This was turned off accidentally in #7591.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7637 from rxin/enable-mem-leak-detect and squashes the following commits:
      
      34bc3ef [Reynold Xin] Enable memory leak detection for Tungsten.
      8fe32b4f
  7. Jul 23, 2015
    • Reynold Xin's avatar
      Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow" · fb36397b
      Reynold Xin authored
      Reverts ObjectPool. As it stands, it has a few problems:
      
      1. ObjectPool doesn't work with spilling and memory accounting.
      2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size.
      3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types.
      
      If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation.
      
      This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7591 from rxin/revert-object-pool and squashes the following commits:
      
      01db0bc [Reynold Xin] Scala style.
      eda89fc [Reynold Xin] Fixed describe.
      2967118 [Reynold Xin] Fixed accessor for JoinedRow.
      e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool
      657855f [Reynold Xin] Temp commit.
      c20f2c8 [Reynold Xin] Style fix.
      fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
      fb36397b
  8. Jul 22, 2015
    • Reynold Xin's avatar
      [SPARK-9262][build] Treat Scala compiler warnings as errors · d71a13f4
      Reynold Xin authored
      I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.
      
      Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
      
      Most of the work are done by ericl.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7598 from rxin/warnings and squashes the following commits:
      
      beb311b [Reynold Xin] Fixed tests.
      542c031 [Reynold Xin] Fixed one more warning.
      87c354a [Reynold Xin] Fixed all non-deprecation warnings.
      78660ac [Eric Liang] first effort to fix warnings
      d71a13f4
  9. Jul 21, 2015
    • Reynold Xin's avatar
      [SPARK-8906][SQL] Move all internal data source classes into execution.datasources. · 60c0ce13
      Reynold Xin authored
      This way, the sources package contains only public facing interfaces.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7565 from rxin/move-ds and squashes the following commits:
      
      7661aff [Reynold Xin] Mima
      9d5196a [Reynold Xin] Rearranged imports.
      3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
      60c0ce13
  10. Jul 20, 2015
    • Davies Liu's avatar
      [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type · 9f913c4f
      Davies Liu authored
      This PR also remove the duplicated code between registerFunction and UserDefinedFunction.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7450 from davies/fix_return_type and squashes the following commits:
      
      e80bf9f [Davies Liu] remove debugging code
      f94b1f6 [Davies Liu] fix mima
      8f9c58b [Davies Liu] convert returned object from UDF into internal type
      9f913c4f
    • George Dittmar's avatar
      [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector · 3f7de7db
      George Dittmar authored
      Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.
      
      Author: George Dittmar <georgedittmar@gmail.com>
      Author: George <dittmar@Georges-MacBook-Pro.local>
      Author: dittmarg <george.dittmar@webtrends.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:
      
      3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
      127dec5 [Xiangrui Meng] update argmax impl
      2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
      98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      5fd9380 [George Dittmar] fixing style check error
      42341fb [George Dittmar] refactoring arg max check to better handle zero values
      b22af46 [George Dittmar] Fixing spaces between commas in unit test
      f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
      aa330e3 [George Dittmar] Fixing some last if else spacing issues
      ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
      d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
      ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
      3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
      b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
      f21dcce [George Dittmar] commit
      af17981 [dittmarg] Initial work fixing bug that was made clear in pr
      eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
      4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      df9538a [George] Added argmax to sparse vector and added unit test
      3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
      04677af [George] initial work on adding argmax to Vector and SparseVector
      3f7de7db
  11. Jul 18, 2015
    • Reynold Xin's avatar
      [SPARK-8278] Remove non-streaming JSON reader. · 45d798c3
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7501 from rxin/jsonrdd and squashes the following commits:
      
      767ec55 [Reynold Xin] More Mima
      51f456e [Reynold Xin] Mima exclude.
      789cb80 [Reynold Xin] Fixed compilation error.
      b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.
      45d798c3
  12. Jul 17, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
  13. Jul 13, 2015
    • Sun Rui's avatar
      [SPARK-6797] [SPARKR] Add support for YARN cluster mode. · 7f487c8b
      Sun Rui authored
      This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.
      
      This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.
      
      This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.
      
      This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.
      
      R/install-dev.bat is not tested. shivaram , Could you help to test it?
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits:
      
      ca63c86 [Sun Rui] Adjust MimaExcludes after rebase.
      7313374 [Sun Rui] Fix unit test errors.
      72695fb [Sun Rui] Fix unit test failures.
      193882f [Sun Rui] Fix Mima test error.
      fe25a33 [Sun Rui] Fix Mima test error.
      35ecfa3 [Sun Rui] Fix comments.
      c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes.
      b05340c [Sun Rui] Fix scala style.
      2ca5048 [Sun Rui] Fix comments.
      1acefd1 [Sun Rui] Fix scala style.
      0aa1e97 [Sun Rui] Fix scala style.
      41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs.
      49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat.
      7b916c5 [Sun Rui] Use 'rem' consistently.
      3bed438 [Sun Rui] Add a comment.
      681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes.
      cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
      7f487c8b
  14. Jul 10, 2015
    • Jonathan Alter's avatar
      [SPARK-7977] [BUILD] Disallowing println · e14b545d
      Jonathan Alter authored
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7093 from jonalter/SPARK-7977 and squashes the following commits:
      
      ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite
      7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite
      10724b6 [Jonathan Alter] Changing some printlns to logs in tests
      eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0b1dcb4 [Jonathan Alter] More println cleanup
      aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      0c16fa3 [Jonathan Alter] Replacing some printlns with logs
      45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      5c8e283 [Jonathan Alter] Allowing println in audit-release examples
      5b50da1 [Jonathan Alter] Allowing printlns in example files
      ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      83ab635 [Jonathan Alter] Fixing new printlns
      54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977
      1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns
      b837c3a [Jonathan Alter] Disallowing println
      e14b545d
  15. Jul 09, 2015
    • zsxwing's avatar
      [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page · 1f6b0b12
      zsxwing authored
      This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.
      
      For example,
      
      ![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
      
      FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7081 from zsxwing/input-metadata and squashes the following commits:
      
      f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
      d906209 [zsxwing] Merge branch 'master' into input-metadata
      74762da [zsxwing] Fix MiMa tests
      7903e33 [zsxwing] Merge branch 'master' into input-metadata
      450a46c [zsxwing] Address comments
      1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
      d496ae9 [zsxwing] Add input metadata in the batch page
      1f6b0b12
  16. Jul 08, 2015
    • Davies Liu's avatar
      [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame · 74d8d3d9
      Davies Liu authored
      This PR fixes the converter for Python DataFrame, especially for DecimalType
      
      Closes #7106
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7131 from davies/decimal_python and squashes the following commits:
      
      4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7d73168 [Davies Liu] fix conflit
      6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7104e97 [Davies Liu] improve type infer
      9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
      829a05b [Davies Liu] fix UDT in python
      c99e8c5 [Davies Liu] fix mima
      c46814a [Davies Liu] convert decimal for Python DataFrames
      74d8d3d9
    • Kousuke Saruta's avatar
      [SPARK-8914][SQL] Remove RDDApi · 2a4f88b6
      Kousuke Saruta authored
      As rxin suggested in #7298 , we should consider to remove `RDDApi`.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7302 from sarutak/remove-rddapi and squashes the following commits:
      
      e495d35 [Kousuke Saruta] Fixed mima
      cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
      2a4f88b6
    • Cheng Lian's avatar
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
      Cheng Lian authored
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
      
      This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
      
      ### Major changes
      
      1. `CatalystConverter` class hierarchy refactoring
      
         - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
      
           Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
      
           This simplifies the design since converters don't need to care about details of their parent converters anymore.
      
         - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
      
           Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
      
         - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
      
           `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
      
           The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
      
         - Implements backwards-compatibility rules in `CatalystArrayConverter`
      
           When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
      
      2. Requested columns handling
      
         When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
      
         In this PR, the schema for requested columns is constructed using the following method:
      
         - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
         - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
         - Unions all single-field `MessageType`s into a full schema containing all requested fields
      
         With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
      
      ### Testing
      
      This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
      
      [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
      [2]: https://issues.apache.org/jira/browse/SPARK-6774
      [3]: https://issues.apache.org/jira/browse/SPARK-6123
      [4]: https://issues.apache.org/jira/browse/SPARK-8848
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7231 from liancheng/spark-6776 and squashes the following commits:
      
      360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
      c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
      b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
      598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
      926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
      7946ee1 [Cheng Lian] Fixes Scala styling issues
      3d7ab36 [Cheng Lian] Fixes .rat-excludes
      a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
      f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
      1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
      440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
      13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
      06cfe9d [Cheng Lian] Adds comments about TimestampType handling
      a099d3e [Cheng Lian] More comments
      0cc1b37 [Cheng Lian] Fixes MiMa checks
      884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
      802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
      38fe1e7 [Cheng Lian] Adds explicit return type
      7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
      1781dff [Cheng Lian] Adds test case for SPARK-8811
      6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
      bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
      a74fb2c [Cheng Lian] More comments
      0525346 [Cheng Lian] Removes old Parquet record converters
      03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
      4ffc27ca
    • DB Tsai's avatar
      [SPARK-8700][ML] Disable feature scaling in Logistic Regression · 57221934
      DB Tsai authored
      All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
      
      In R, there is an option for this.
      `standardize`
      Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
      
      +cc holdenk mengxr jkbradley
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7080 from dbtsai/lors and squashes the following commits:
      
      877e6c7 [DB Tsai] repahse the doc
      7cf45f2 [DB Tsai] address feedback
      78d75c9 [DB Tsai] small change
      c2c9e60 [DB Tsai] style
      6e1a8e0 [DB Tsai] first commit
      57221934
  17. Jul 03, 2015
  18. Jul 02, 2015
    • MechCoder's avatar
      [SPARK-8479] [MLLIB] Add numNonzeros and numActives to linalg.Matrices · 34d448db
      MechCoder authored
      Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6904 from MechCoder/nnz_matrix and squashes the following commits:
      
      252c6b7 [MechCoder] Add to MiMa excludes
      e2390f5 [MechCoder] Use count instead of foreach
      2f62b2f [MechCoder] Add to MiMa excludes
      d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
      34d448db
  19. Jul 01, 2015
    • jerryshao's avatar
      [SPARK-7820] [BUILD] Fix Java8-tests suite compile and test error under sbt · 9f7db348
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #7120 from jerryshao/SPARK-7820 and squashes the following commits:
      
      6902439 [jerryshao] fix Java8-tests suite compile error under sbt
      9f7db348
    • zsxwing's avatar
      [SPARK-8378] [STREAMING] Add the Python API for Flume · 75b9fe4c
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6830 from zsxwing/flume-python and squashes the following commits:
      
      78dfdac [zsxwing] Fix the compile error in the test code
      f1bf3c0 [zsxwing] Address TD's comments
      0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
      e93736b [zsxwing] Fix the test case for determine_modules_to_test
      9d5821e [zsxwing] Fix pyspark_core dependencies
      f9ee681 [zsxwing] Merge branch 'master' into flume-python
      7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
      b96b0de [zsxwing] Merge branch 'master' into flume-python
      ce85e83 [zsxwing] Fix incompatible issues for Python 3
      01cbb3d [zsxwing] Add import sys
      152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
      14ba0ff [zsxwing] Add flume-assembly for sbt building
      b8d5551 [zsxwing] Merge branch 'master' into flume-python
      4762c34 [zsxwing] Fix the doc
      0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
      9f33873 [zsxwing] Add the Python API for Flume
      75b9fe4c
  20. Jun 24, 2015
    • Cheng Lian's avatar
      [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter · 8ab50765
      Cheng Lian authored
      This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa.  Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are:
      
      1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed).
      
         Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow.
      
      1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools.
      1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`).
      
      To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs.
      
      TODO
      
      - [x] More schema conversion test cases for legacy schema patterns.
      
      [1]: https://github.com/apache/parquet-format/blob/ea095226597fdbecd60c2419d96b54b2fdb4ae6c/LogicalTypes.md
      [2]: https://github.com/apache/parquet-mr/
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6617 from liancheng/spark-6777 and squashes the following commits:
      
      2a2062d [Cheng Lian] Don't convert decimals without precision information
      b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString
      743730f [Cheng Lian] Decimal scale shouldn't be larger than precision
      a104a9e [Cheng Lian] Fixes Scala style issue
      1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion
      ba84f4b [Cheng Lian] Fixes MapType schema conversion bug
      13cb8d5 [Cheng Lian] Fixes MiMa failure
      81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests
      28ef95b [Cheng Lian] More AnalysisExceptions
      b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException
      cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
      8ab50765
    • Josh Rosen's avatar
      [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project · 13ae806b
      Josh Rosen authored
      This commit changes the MiMa tests to test against the released 1.4.0 artifacts rather than 1.4.0-rc4; this change is necessary to fix a Jenkins build break since it seems that the RC4 snapshot is no longer available via Maven.
      
      I also enabled MiMa checks for the `launcher` subproject, which we should have done right after 1.4.0 was released.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6974 from JoshRosen/mima-hotfix and squashes the following commits:
      
      4b4175a [Josh Rosen] [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project
      13ae806b
  21. Jun 23, 2015
    • Holden Karau's avatar
      [SPARK-7888] Be able to disable intercept in linear regression in ml package · 2b1111dd
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits:
      
      0ad384c [Holden Karau] Add MiMa excludes
      4016fac [Holden Karau] Switch to wild card import, remove extra blank lines
      ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above
      f34971c [Holden Karau] Fix some more long lines
      319bd3f [Holden Karau] Fix long lines
      3bb9ee1 [Holden Karau] Update the regression suite tests
      7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable
      0b0c8c0 [Holden Karau] fix the issue with the sample R code
      e2140ba [Holden Karau] Add a test, it fails!
      5e84a0b [Holden Karau] Write out thoughts and use the correct trait
      91ffc0a [Holden Karau] more murh
      006246c [Holden Karau] murp?
      2b1111dd
  22. Jun 22, 2015
    • Davies Liu's avatar
      [SPARK-8307] [SQL] improve timestamp from parquet · 6b7f2cea
      Davies Liu authored
      This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).
      
      cc adrian-wang rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6759 from davies/improve_ts and squashes the following commits:
      
      849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      8e2d56f [Davies Liu] address comments
      634b9f5 [Davies Liu] fix mima
      4891efb [Davies Liu] address comment
      bfc437c [Davies Liu] fix build
      ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      602b969 [Davies Liu] remove jodd
      2f2e48c [Davies Liu] fix test
      8ace611 [Davies Liu] fix mima
      212143b [Davies Liu] fix mina
      c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      5233974 [Davies Liu] fix scala style
      361fd62 [Davies Liu] address comments
      ea196d4 [Davies Liu] improve timestamp from parquet
      6b7f2cea
  23. Jun 19, 2015
    • cody koeninger's avatar
      [SPARK-8127] [STREAMING] [KAFKA] KafkaRDD optimize count() take() isEmpty() · 1b6fe9b1
      cody koeninger authored
      …ed KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #6632 from koeninger/kafka-rdd-count and squashes the following commits:
      
      321340d [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of ordering of take()
      5a05d0f [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of isEmpty
      f68bd32 [cody koeninger] [Streaming][Kafka][SPARK-8127] code cleanup
      9555b73 [cody koeninger] Merge branch 'master' into kafka-rdd-count
      253031d [cody koeninger] [Streaming][Kafka][SPARK-8127] mima exclusion for change to private method
      8974b9e [cody koeninger] [Streaming][Kafka][SPARK-8127] check offset ranges before constructing KafkaRDD
      c3768c5 [cody koeninger] [Streaming][Kafka] Take advantage of offset range info for size-related KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
      1b6fe9b1
  24. Jun 17, 2015
  25. Jun 16, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8126] [BUILD] Make sure temp dir exists when running tests. · cebf2411
      Marcelo Vanzin authored
      If you ran "clean" at the top-level sbt project, the temp dir would
      go away, so running "test" without restarting sbt would fail. This
      fixes that by making sure the temp dir exists before running tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits:
      
      12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.
      cebf2411
  26. Jun 11, 2015
    • Reynold Xin's avatar
      [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package. · 7d669a56
      Reynold Xin authored
      Unit test is still in Scala.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6738 from rxin/utf8string-java and squashes the following commits:
      
      562dc6e [Reynold Xin] Flag...
      98e600b [Reynold Xin] Another try with encoding setting ..
      cfa6bdf [Reynold Xin] Merge branch 'master' into utf8string-java
      a3b124d [Reynold Xin] Try different UTF-8 encoded characters.
      1ff7c82 [Reynold Xin] Enable UTF-8 encoding.
      82d58cc [Reynold Xin] Reset run-tests.
      2cb3c69 [Reynold Xin] Use utf-8 encoding in set bytes.
      53f8ef4 [Reynold Xin] Hack Jenkins to run one test.
      9a48e8d [Reynold Xin] Fixed runtime compilation error.
      911c450 [Reynold Xin] Moved unit test also to Java.
      4eff7bd [Reynold Xin] Improved unit test coverage.
      8e89a3c [Reynold Xin] Fixed tests.
      77c64bd [Reynold Xin] Fixed string type codegen.
      ffedb62 [Reynold Xin] Code review feedback.
      0967ce6 [Reynold Xin] Fixed import ordering.
      45a123d [Reynold Xin] [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.
      7d669a56
    • Adam Roberts's avatar
      [SPARK-8289] Specify stack size for consistency with Java tests - resolves test failures · 6b68366d
      Adam Roberts authored
      This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).
      
      -Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).
      
      I have ensured this does not have any negative implications for other tests.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      Author: a-roberts <aroberts@uk.ibm.com>
      
      Closes #6727 from a-roberts/IncJavaStackSize and squashes the following commits:
      
      ab40aea [Adam Roberts] Specify stack size for SBT builds
      5032d8d [a-roberts] Update pom.xml
      6b68366d
  27. Jun 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8126] [BUILD] Use custom temp directory during build. · a1d9e5cc
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6674 from vanzin/SPARK-8126 and squashes the following commits:
      
      0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
      643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
      a1d9e5cc
  28. Jun 07, 2015
    • cody koeninger's avatar
      [SPARK-2808] [STREAMING] [KAFKA] cleanup tests from · b127ff8a
      cody koeninger authored
      see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #5921 from koeninger/kafka-0.8.2-test-cleanup and squashes the following commits:
      
      1e89dc8 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
      4662828 [cody koeninger] [Streaming][Kafka] filter mima issue for removal of method from private test class
      af1e083 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
      4298ac2 [cody koeninger] [Streaming][Kafka] update comment to trigger jenkins attempt
      1274afb [cody koeninger] [Streaming][Kafka] see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
      b127ff8a
  29. Jun 05, 2015
    • Andrew Or's avatar
      Revert "[MINOR] [BUILD] Use custom temp directory during build." · 4036d05c
      Andrew Or authored
      This reverts commit b16b5434.
      4036d05c
    • Marcelo Vanzin's avatar
      [MINOR] [BUILD] Use custom temp directory during build. · b16b5434
      Marcelo Vanzin authored
      Even with all the efforts to cleanup the temp directories created by
      unit tests, Spark leaves a lot of garbage in /tmp after a test run.
      This change overrides java.io.tmpdir to place those files under the
      build directory instead.
      
      After an sbt full unit test run, I was left with > 400 MB of temp
      files. Since they're now under the build dir, it's much easier to
      clean them up.
      
      Also make a slight change to a unit test to make it not pollute the
      source directory with test data.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6653 from vanzin/unit-test-tmp and squashes the following commits:
      
      31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other.
      aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build.
      b16b5434
  30. Jun 04, 2015
    • Josh Rosen's avatar
      [SPARK-8106] [SQL] Set derby.system.durability=test to speed up Hive compatibility tests · 74dc2a90
      Josh Rosen authored
      Derby has a `derby.system.durability` configuration property that can be used to disable I/O synchronization calls for writes. This sacrifices durability but can result in large performance gains, which is appropriate for tests.
      
      We should enable this in our test system properties in order to speed up the Hive compatibility tests. I saw 2-3x speedups locally with this change.
      
      See https://db.apache.org/derby/docs/10.8/ref/rrefproperdurability.html for more documentation of this property.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6651 from JoshRosen/hive-compat-suite-speedup and squashes the following commits:
      
      b7a08a2 [Josh Rosen] Set derby.system.durability=test in our unit tests.
      74dc2a90
    • Reynold Xin's avatar
      [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate · 2bcdf8c2
      Reynold Xin authored
      This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
      more efficient over time as we optimize Aggregate (via Tungsten).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6637 from rxin/replace-distinct and squashes the following commits:
      
      b3cc50e [Reynold Xin] Mima excludes.
      93d6117 [Reynold Xin] Code review feedback.
      87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
      2bcdf8c2
Loading