Skip to content
Snippets Groups Projects
  1. Aug 10, 2015
    • Prabeesh K's avatar
      [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python · 853809e9
      Prabeesh K authored
      This PR is based on #4229, thanks prabeesh.
      
      Closes #4229
      
      Author: Prabeesh K <prabsmails@gmail.com>
      Author: zsxwing <zsxwing@gmail.com>
      Author: prabs <prabsmails@gmail.com>
      Author: Prabeesh K <prabeesh.k@namshi.com>
      
      Closes #7833 from zsxwing/pr4229 and squashes the following commits:
      
      9570bec [zsxwing] Fix the variable name and check null in finally
      4a9c79e [zsxwing] Fix pom.xml indentation
      abf5f18 [zsxwing] Merge branch 'master' into pr4229
      935615c [zsxwing] Fix the flaky MQTT tests
      47278c5 [zsxwing] Include the project class files
      478f844 [zsxwing] Add unpack
      5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
      734db99 [zsxwing] Merge branch 'master' into pr4229
      126608a [Prabeesh K] address the comments
      b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
      d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
      a6747cb [Prabeesh K] wait for starting the receiver before publishing data
      87fc677 [Prabeesh K] address the comments:
      97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
      80474d1 [Prabeesh K] fix
      1f0cfe9 [Prabeesh K] python style fix
      e1ee016 [Prabeesh K] scala style fix
      a5a8f9f [Prabeesh K] added Python test
      9767d82 [Prabeesh K] implemented Python-friendly class
      a11968b [Prabeesh K] fixed python style
      795ec27 [Prabeesh K] address comments
      ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
      3f4df12 [Prabeesh K] updated version
      b34c3c1 [prabs] adress comments
      3aa7fff [prabs] Added Python streaming mqtt word count example
      b7d42ff [prabs] Mqtt streaming support in Python
      853809e9
    • Mahmoud Lababidi's avatar
      Fixed AtmoicReference<> Example · d2852127
      Mahmoud Lababidi authored
      Author: Mahmoud Lababidi <lababidi@gmail.com>
      
      Closes #8076 from lababidi/master and squashes the following commits:
      
      af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example
      d2852127
  2. Aug 06, 2015
  3. Aug 05, 2015
    • Mike Dusenberry's avatar
      [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark. · 34dcf101
      Mike Dusenberry authored
      mengxr This adds the `BlockMatrix` to PySpark.  I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:
      
      27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
      ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
      b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
      c014002 [Mike Dusenberry] Using properties for better documentation.
      3bda6ab [Mike Dusenberry] Adding documentation.
      8fb3095 [Mike Dusenberry] Small cleanup.
      e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
      34dcf101
    • Namit Katariya's avatar
      [SPARK-9601] [DOCS] Fix JavaPairDStream signature for stream-stream and... · 1bf608b5
      Namit Katariya authored
      [SPARK-9601] [DOCS] Fix JavaPairDStream signature for stream-stream and windowed join in streaming guide doc
      
      Author: Namit Katariya <katariya.namit@gmail.com>
      
      Closes #7935 from namitk/SPARK-9601 and squashes the following commits:
      
      03b5784 [Namit Katariya] [SPARK-9601] Fix signature of JavaPairDStream for stream-stream and windowed join in streaming guide doc
      1bf608b5
    • Reynold Xin's avatar
      Update docs/README.md to put all prereqs together. · f7abd6be
      Reynold Xin authored
      This pull request groups all the prereq requirements into a single section.
      
      cc srowen shivaram
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7951 from rxin/readme-docs and squashes the following commits:
      
      ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together.
      f7abd6be
  4. Aug 04, 2015
    • Mike Dusenberry's avatar
      [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. · 571d5b53
      Mike Dusenberry authored
      This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark.  Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object.  New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class.  This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code.  Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  Associated documentation and unit-tests have also been added.  To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
      
      bb039cb [Mike Dusenberry] Minor documentation update.
      b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner.  Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that.  If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly.  This is only for internal usage, and publicly, we still require 'rows' to be an RDD.  We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed.  The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
      7f0dcb6 [Mike Dusenberry] Updating module docstring.
      cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
      687e345 [Mike Dusenberry] Improving conversion performance.  This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
      3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
      308f197 [Mike Dusenberry] Using properties for better documentation.
      1633f86 [Mike Dusenberry] Minor documentation cleanup.
      f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
      ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
      3fd4016 [Mike Dusenberry] Updating docstrings.
      27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
      a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
      d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
      4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
      c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
      329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
      0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
      c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
      4ad6819 [Mike Dusenberry] Documenting the  and  parameters.
      3b854b9 [Mike Dusenberry] Minor updates to documentation.
      10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
      119018d [Mike Dusenberry] Adding static  methods to each of the distributed matrix classes to consolidate conversion logic.
      4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
      93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
      f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
      6a3ecb7 [Mike Dusenberry] Updating pattern matching.
      08f287b [Mike Dusenberry] Slight reformatting of the documentation.
      a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output.  This is fine since the values are all small, and thus can be easily represented as ints.
      4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
      7e3ca16 [Mike Dusenberry] Fixing long lines.
      f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
      ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
      dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices.  Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
      0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
      3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier.  The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction.  This way, we can call  for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object.  This is analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
      4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
      23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
      b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices factory methods to accept numRows and numCols with default values.  Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
      bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
      d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices.  Added a factory method for creating a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
      571d5b53
  5. Aug 03, 2015
    • Sean Owen's avatar
      [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build · 0afa6fbf
      Sean Owen authored
      Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits:
      
      73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      0afa6fbf
    • Shivaram Venkataraman's avatar
      Add a prerequisites section for building docs · 7abaaad5
      Shivaram Venkataraman authored
      This puts all the install commands that need to be run in one section instead of being spread over many paragraphs
      
      cc rxin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7912 from shivaram/docs-setup-readme and squashes the following commits:
      
      cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
      7abaaad5
    • Yanbo Liang's avatar
      [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples · 8ca287eb
      Yanbo Liang authored
      Add ml.PCA user guide document and code examples for Scala/Java/Python.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7522 from yanboliang/ml-pca-md and squashes the following commits:
      
      60dec05 [Yanbo Liang] address comments
      f992abe [Yanbo Liang] Add ml.PCA doc and examples
      8ca287eb
    • Kousuke Saruta's avatar
      [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults. · ba1c4e13
      Kousuke Saruta authored
      Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits:
      
      a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
      ba1c4e13
  6. Aug 02, 2015
    • KaiXinXiaoLei's avatar
      [SPARK-9535][SQL][DOCS] Modify document for codegen. · 536d2adc
      KaiXinXiaoLei authored
      #7142 made codegen enabled by default so let's modify the corresponding documents.
      
      Closes #7142
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7863 from sarutak/SPARK-9535 and squashes the following commits:
      
      0884424 [Kousuke Saruta] Removed a line which mentioned about the effect of codegen enabled
      3c11af0 [Kousuke Saruta] Merge branch 'sqlconfig' of https://github.com/KaiXinXiaoLei/spark into SPARK-9535
      4ee531d [KaiXinXiaoLei] delete space
      4cfd11d [KaiXinXiaoLei] change spark.sql.planner.externalSort
      d624cf8 [KaiXinXiaoLei] sql config is wrong
      536d2adc
  7. Jul 31, 2015
    • Sean Owen's avatar
      [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code... · 873ab0f9
      Sean Owen authored
      [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement
      
      Use print(x) not print x for Python 3 in eval examples
      CC sethah mengxr -- just wanted to close this out before 1.5
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7822 from srowen/SPARK-9490 and squashes the following commits:
      
      01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too
      bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples
      873ab0f9
    • CodingCat's avatar
      [SPARK-9202] capping maximum number of executor&driver information kept in Worker · c0686668
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-9202
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #7714 from CodingCat/SPARK-9202 and squashes the following commits:
      
      23977fb [CodingCat] add comments about why we don't synchronize finishedExecutors & finishedDrivers
      dc9772d [CodingCat] addressing the comments
      e125241 [CodingCat] stylistic fix
      80bfe52 [CodingCat] fix JsonProtocolSuite
      d7d9485 [CodingCat] styistic fix and respect insert ordering
      031755f [CodingCat] add license info & stylistic fix
      c3b5361 [CodingCat] test cases and docs
      c557b3a [CodingCat] applications are fine
      9cac751 [CodingCat] application is fine...
      ad87ed7 [CodingCat] trimFinishedExecutorsAndDrivers
      c0686668
    • zsxwing's avatar
      [SPARK-8564] [STREAMING] Add the Python API for Kinesis · 3afc1de8
      zsxwing authored
      This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6955 from zsxwing/kinesis-python and squashes the following commits:
      
      e42e471 [zsxwing] Merge branch 'master' into kinesis-python
      455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
      32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      5082d28 [zsxwing] Fix the syntax error for Python 2.6
      fca416b [zsxwing] Fix wrong comparison
      96670ff [zsxwing] Fix the compilation error after merging master
      756a128 [zsxwing] Merge branch 'master' into kinesis-python
      6c37395 [zsxwing] Print stack trace for debug
      7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
      cc9d071 [zsxwing] Fix the python test errors
      466b425 [zsxwing] Add python tests for Kinesis
      e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      3da2601 [zsxwing] Fix the kinesis folder
      687446b [zsxwing] Fix the error message and the maven output path
      add2beb [zsxwing] Merge branch 'master' into kinesis-python
      4957c0b [zsxwing] Add the Python API for Kinesis
      3afc1de8
  8. Jul 29, 2015
    • sethah's avatar
      [SPARK-6129] [MLLIB] [DOCS] Added user guide for evaluation metrics · 2a9fe4a4
      sethah authored
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #7655 from sethah/Working_on_6129 and squashes the following commits:
      
      253db2d [sethah] removed number formatting from example code
      b769cab [sethah] rewording threshold section
      d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
      3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
      c9dd058 [sethah] Cleaning up and formatting metrics user guide section
      6f31c21 [sethah] All example code for metrics section done
      98813fe [sethah] Most java and python example code added. Further latex formatting
      53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
      2a9fe4a4
  9. Jul 28, 2015
  10. Jul 27, 2015
    • Alexander Ulanov's avatar
      Pregel example type fix · 90006f3c
      Alexander Ulanov authored
      Pregel example to express single source shortest path from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api does not work due to incorrect type. The reason is that `GraphGenerators.logNormalGraph` returns the graph with `Long` vertices. Fixing `val graph: Graph[Int, Double]` to `val graph: Graph[Long, Double]`.
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      
      Closes #7695 from avulanov/SPARK-9380-pregel-doc and squashes the following commits:
      
      c269429 [Alexander Ulanov] Pregel example type fix
      90006f3c
    • Carson Wang's avatar
      [SPARK-8405] [DOC] Add how to view logs on Web UI when yarn log aggregation is enabled · 62283816
      Carson Wang authored
      Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7463 from carsonwang/YarnLogDoc and squashes the following commits:
      
      274c054 [Carson Wang] Minor text fix
      74df3a1 [Carson Wang] address comments
      5a95046 [Carson Wang] Update the text in the doc
      e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
      62283816
  11. Jul 23, 2015
    • Cheng Lian's avatar
      [SPARK-9207] [SQL] Enables Parquet filter push-down by default · bebe3f7b
      Cheng Lian authored
      PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7612 from liancheng/spark-9207 and squashes the following commits:
      
      77e6b5e [Cheng Lian] Enables Parquet filter push-down by default
      bebe3f7b
  12. Jul 22, 2015
    • Josh Rosen's avatar
      [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled · b217230f
      Josh Rosen authored
      Spark has an option called spark.localExecution.enabled; according to the docs:
      
      > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
      
      This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
      
      This pull request simply brings #7484 up to date.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7585 from rxin/remove-local-exec and squashes the following commits:
      
      84bd10e [Reynold Xin] Python fix.
      1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
      eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
      b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
      8975d96 [Josh Rosen] Remove local execution tests.
      ffa8c9b [Josh Rosen] Remove documentation for configuration
      b217230f
    • Matei Zaharia's avatar
      [SPARK-9244] Increase some memory defaults · fe26584a
      Matei Zaharia authored
      There are a few memory limits that people hit often and that we could
      make higher, especially now that memory sizes have grown.
      
      - spark.akka.frameSize: This defaults at 10 but is often hit for map
        output statuses in large shuffles. This memory is not fully allocated
        up-front, so we can just make this larger and still not affect jobs
        that never sent a status that large. We increase it to 128.
      
      - spark.executor.memory: Defaults at 512m, which is really small. We
        increase it to 1g.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #7586 from mateiz/configs and squashes the following commits:
      
      ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
      fe26584a
  13. Jul 21, 2015
    • MechCoder's avatar
      [SPARK-5989] [MLLIB] Model save/load for LDA · 89db3c0b
      MechCoder authored
      Add support for saving and loading LDA both the local and distributed versions.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6948 from MechCoder/lda_save_load and squashes the following commits:
      
      49bcdce [MechCoder] minor style fixes
      cc14054 [MechCoder] minor
      4587d1d [MechCoder] Minor changes
      c753122 [MechCoder] Load and save the model in private methods
      2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
      89db3c0b
    • Michael Allman's avatar
      [SPARK-8401] [BUILD] Scala version switching build enhancements · f5b6dc5e
      Michael Allman authored
      These commits address a few minor issues in the Scala cross-version support in the build:
      
        1. Correct two missing `${scala.binary.version}` pom file substitutions.
        2. Don't update `scala.binary.version` in parent POM. This property is set through profiles.
        3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`.
        4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed.
      
      This is my original work and I license this work to the Spark project under the Apache License.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #6832 from mallman/scala-versions and squashes the following commits:
      
      cde2f17 [Michael Allman] Delete dev/change-version-to-*.sh, replacing them with single dev/change-scala-version.sh script that takes a version as argument
      02296f2 [Michael Allman] Make the scala version change scripts cross-platform by restricting ourselves to POSIX sed syntax instead of looking for GNU sed
      ad9b40a [Michael Allman] Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed
      bdd20bf [Michael Allman] Update source of scaladocs when changing Scala version
      475088e [Michael Allman] Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version}
      f5b6dc5e
  14. Jul 16, 2015
    • Timothy Chen's avatar
      [SPARK-6284] [MESOS] Add mesos role, principal and secret · d86bbb4e
      Timothy Chen authored
      Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master.
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #4960 from tnachen/mesos_fw_auth and squashes the following commits:
      
      0f9f03e [Timothy Chen] Fix review comments.
      8f9488a [Timothy Chen] Fix rebase
      f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.
      d86bbb4e
  15. Jul 15, 2015
    • Shuo Xiang's avatar
      [SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide · 303c1201
      Shuo Xiang authored
      jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`.
      
      dbtsai I left the code tab for you to add example code. Do you think it is the right place?
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6504 from coderxiang/elasticnet and squashes the following commits:
      
      f6061ee [Shuo Xiang] typo
      90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
      0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods
      8747190 [Shuo Xiang] merge master
      706d3f7 [Shuo Xiang] add python code
      9bc2b4c [Shuo Xiang] typo
      db32a60 [Shuo Xiang] java code sample
      aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
      a0dae07 [Shuo Xiang] simplify code
      d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge
      df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md
      78d9366 [Shuo Xiang] address comments
      8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet
      8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
      998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
      89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
      9262a72 [Shuo Xiang] update
      7e07d12 [Shuo Xiang] update
      b32f21a [Shuo Xiang] add doc for elastic net in sparkml
      937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
      180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      303c1201
    • FlytxtRnD's avatar
      [SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as param · 3f6296fe
      FlytxtRnD authored
       This allows Kmeans to be initialized using an existing set of cluster centers provided as  a KMeansModel object. This mode of initialization performs a single run.
      
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:
      
      94b56df [FlytxtRnD] style correction
      ef95ee2 [FlytxtRnD] style correction
      c446c58 [FlytxtRnD] documentation and numRuns warning change
      06d13ef [FlytxtRnD] numRuns corrected
      d12336e [FlytxtRnD] numRuns variable modifications
      07f8554 [FlytxtRnD] remove setRuns from setIntialModel
      e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      242ead1 [FlytxtRnD] corrected == to === in assert
      714acb5 [FlytxtRnD] added numRuns
      60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
      582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      3f5fc8e [FlytxtRnD] test case modified and one runs condition added
      cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
      e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
      6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
      3f6296fe
  16. Jul 14, 2015
    • zhaishidan's avatar
      [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about... · c1feebd8
      zhaishidan authored
      [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about `spark.kryoserializer.buffer`
      
      The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.".
      
      The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.
      
      Author: zhaishidan <zhaishidan@haizhi.com>
      
      Closes #7393 from stanzhai/master and squashes the following commits:
      
      69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
      c1feebd8
  17. Jul 10, 2015
    • jose.cambronero's avatar
      [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs · 9c507577
      jose.cambronero authored
      This contribution is my original work and I license it to the project under it's open source license.
      
      Author: jose.cambronero <jose.cambronero@cloudera.com>
      
      Closes #6994 from josepablocam/master and squashes the following commits:
      
      bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
      0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
      1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
      a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
      1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
      2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
      a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
      7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
      e760ebd [jose.cambronero] line length changes to fit style check
      3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
      9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
      3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
      992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
      6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
      4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
      0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
      16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
      c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
      f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
      b9cff3a [jose.cambronero] made small changes to pass style check
      ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
      4da189b [jose.cambronero] added user facing ks test functions
      c659ea1 [jose.cambronero] created KS test class
      13dfe4d [jose.cambronero] created test result class for ks test
      9c507577
    • Andrew Or's avatar
      [SPARK-8958] Dynamic allocation: change cached timeout to infinity · 5dd45bde
      Andrew Or authored
      pwendell and I discussed this a little more offline and concluded that it would be good to keep it more conservative. Losing cached blocks may be very expensive and we should only allow it if the user knows what he/she is doing.
      
      FYI harishreedharan sryza.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7329 from andrewor14/da-cached-timeout and squashes the following commits:
      
      cef0b4e [Andrew Or] Change timeout to infinity
      5dd45bde
  18. Jul 09, 2015
    • Michael Vogiatzis's avatar
      [DOCS] Added important updateStateByKey details · d538919c
      Michael Vogiatzis authored
      Runs for *all* existing keys and returning "None" will remove the key-value pair.
      
      Author: Michael Vogiatzis <michaelvogiatzis@gmail.com>
      
      Closes #7229 from mvogiatzis/patch-1 and squashes the following commits:
      
      e7a2946 [Michael Vogiatzis] Updated updateStateByKey text
      00283ed [Michael Vogiatzis] Removed space
      c2656f9 [Michael Vogiatzis] Moved description farther up
      0a42551 [Michael Vogiatzis] Added important updateStateByKey details
      d538919c
  19. Jul 08, 2015
    • Jonathan Alter's avatar
      [SPARK-8927] [DOCS] Format wrong for some config descriptions · 28fa01e2
      Jonathan Alter authored
      A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row.
      
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7292 from jonalter/docs-config and squashes the following commits:
      
      5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
      28fa01e2
    • Alok Singh's avatar
      [SPARK-8909][Documentation] Change the scala example in sql-programmi… · 8f3cd932
      Alok Singh authored
      …ng-guide#Manually Specifying Options to be in sync with java,python, R version
      
      Author: Alok Singh <“singhal@us.ibm.com”>
      
      Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:
      
      d3c20ba [Alok Singh] fix the file to .parquet from .json
      d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
      8f3cd932
    • Feynman Liang's avatar
      [SPARK-8457] [ML] NGram Documentation · c5532e2f
      Feynman Liang authored
      Add documentation for NGram feature transformer.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits:
      
      5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
      60d5ac0 [Feynman Liang] Inline API doc and fix indentation
      736ccbc [Feynman Liang] NGram feature transformer documentation
      c5532e2f
    • Shivaram Venkataraman's avatar
      [SPARK-8900] [SPARKR] Fix sparkPackages in init documentation · 374c8a8a
      Shivaram Venkataraman authored
      cc pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits:
      
      c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
      374c8a8a
    • Sun Rui's avatar
      [SPARK-8894] [SPARKR] [DOC] Example code errors in SparkR documentation. · bf02e377
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #7287 from sun-rui/SPARK-8894 and squashes the following commits:
      
      da63898 [Sun Rui] [SPARK-8894][SPARKR][DOC] Example code errors in SparkR documentation.
      bf02e377
Loading