Skip to content
Snippets Groups Projects
  1. Mar 03, 2015
    • Reynold Xin's avatar
      [SPARK-5310][SQL] Fixes to Docs and Datasources API · 54d19689
      Reynold Xin authored
       - Various Fixes to docs
       - Make data source traits actually interfaces
      
      Based on #4862 but with fixed conflicts.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4868 from marmbrus/pr/4862 and squashes the following commits:
      
      fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862
      0208497 [Reynold Xin] Test fixes.
      34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
      54d19689
  2. Feb 19, 2015
    • Sean Owen's avatar
      SPARK-4682 [CORE] Consolidate various 'Clock' classes · 34b7c353
      Sean Owen authored
      Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4514 from srowen/SPARK-4682 and squashes the following commits:
      
      5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark]
      169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names
      277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way
      b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis()
      160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock
      7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
      34b7c353
  3. Feb 17, 2015
    • Michael Armbrust's avatar
      [SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / Documentation · c74b07fa
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4642 from marmbrus/docs and squashes the following commits:
      
      d291c34 [Michael Armbrust] python tests
      9be66e3 [Michael Armbrust] comments
      d56afc2 [Michael Armbrust] fix style
      f004747 [Michael Armbrust] fix build
      c4a907b [Michael Armbrust] fix tests
      42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up.
      c74b07fa
  4. Feb 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-2996] Implement userClassPathFirst for driver, yarn. · 20a60131
      Marcelo Vanzin authored
      Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
      `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
      modifies the system classpath, instead of restricting the changes to the user's class
      loader. So this change implements the behavior of the latter for Yarn, and deprecates
      the more dangerous choice.
      
      To be able to achieve feature-parity, I also implemented the option for drivers (the existing
      option only applies to executors). So now there are two options, each controlling whether
      to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
      aliased to the new one (`spark.executor.userClassPathFirst`).
      
      The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
      was also doing some things that ended up causing JVM errors depending on how things
      were being called.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:
      
      9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
      fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
      a8c69f1 [Marcelo Vanzin] Review feedback.
      cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
      0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
      fe970a7 [Marcelo Vanzin] Review feedback.
      25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
      fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
      2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
      b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a10f379 [Marcelo Vanzin] Some feedback.
      3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      7b57cba [Marcelo Vanzin] Remove now outdated message.
      5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
      fa1aafa [Marcelo Vanzin] Remove write check on user jars.
      89d8072 [Marcelo Vanzin] Cleanups.
      a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
      50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
      7d14397 [Marcelo Vanzin] Register user jars in executor up front.
      7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
      20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
      55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
      0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
      4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
      d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
      46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
      a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
      91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
      a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
      89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
      20a60131
  5. Feb 07, 2015
    • Michael Armbrust's avatar
      [BUILD] Add the ability to launch spark-shell from SBT. · e9a4fe12
      Michael Armbrust authored
      Now you can quickly launch the spark-shell without building an assembly.  For quick development iteration run `build/sbt ~sparkShell` and calling exit will relaunch with any changes.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4438 from marmbrus/sparkShellSbt and squashes the following commits:
      
      b4e44fe [Michael Armbrust] [BUILD] Add the ability to launch spark-shell from SBT.
      e9a4fe12
  6. Feb 06, 2015
    • OopsOutOfMemory's avatar
      [SQL][HiveConsole][DOC] HiveConsole `correct hiveconsole imports` · b62c3524
      OopsOutOfMemory authored
      Sorry for that PR #4330 has some mistakes.
      
      I correct it....  so it works correctly now.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4389 from OopsOutOfMemory/doc and squashes the following commits:
      
      843eed9 [OopsOutOfMemory] correct hiveconsole imports
      b62c3524
    • Joseph K. Bradley's avatar
      [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs · dc0c4490
      Joseph K. Bradley authored
      This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
      
      **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
      * new output columns: rawPrediction, probabilities
        * The “score” column is now called “rawPrediction”
      * Classifiers now provide numClasses
      * Params.get and .set are now protected instead of private[ml].
      * ParamMap now has a size method.
      * new classes: LinearRegression, LinearRegressionModel
      * LogisticRegression now has an intercept.
      
      ### Sketch of APIs (most of which are private[spark] for now)
      
      Abstract classes for learning algorithms (+ corresponding Model abstractions):
      * Classifier (+ ClassificationModel)
      * ProbabilisticClassifier (+ ProbabilisticClassificationModel)
      * Regressor (+ RegressionModel)
      * Predictor (+ PredictionModel)
      * *For all of these*:
       * There is no strongly typed training-time API.
       * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.
      
      Concrete classes: learning algorithms
      * LinearRegression
      * LogisticRegression (updated to use new abstract classes)
       * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)
      
      Other updates:
      * params.scala: Changed Params.set/get to be protected instead of private[ml]
       * This was needed for the example of defining a class from outside of the MLlib namespace.
      * VectorUDT: Will later change from private[spark] to public.
       * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
       * Also, added equals() method.f
      * SPARK-4942 : ML Transformers should allow output cols to be turned on,off
       * Update validateAndTransformSchema
       * Update transform
      * (Updated examples, test suites according to other changes)
      
      New examples:
      * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
       * Added Java version too
      
      Test Suites:
      * LinearRegressionSuite
      * LogisticRegressionSuite
      * + Java versions of above suites
      
      CC: mengxr  etrain  shivaram
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:
      
      405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
      fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
      8316d5e [Joseph K. Bradley] fixes after rebasing on master
      fc62406 [Joseph K. Bradley] fixed test suites after last commit
      bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
      9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
      f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
      216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
      f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
      343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
      82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
      0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
      c3c8da5 [Joseph K. Bradley] small cleanup
      934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
      1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
      4e2f711 [Joseph K. Bradley] rat fix
      bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
      8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
      1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
      adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
      58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
      57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
      e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
      54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
      0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
      601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
      d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
      52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
      d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
      bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
      dc0c4490
  7. Feb 05, 2015
    • Xiangrui Meng's avatar
      [SPARK-5620][DOC] group methods in generated unidoc · 85ccee81
      Xiangrui Meng authored
      It seems that `(ScalaUnidoc, unidoc)` is the correct way to overwrite `scalacOptions` in unidoc.
      
      CC: rxin gzm0
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4404 from mengxr/SPARK-5620 and squashes the following commits:
      
      f890cf5 [Xiangrui Meng] add -groups to scalacOptions in unidoc
      85ccee81
  8. Feb 04, 2015
    • OopsOutOfMemory's avatar
      [SQL][Hiveconsole] Bring hive console code up to date and update README.md · b73d5fff
      OopsOutOfMemory authored
      Add `import org.apache.spark.sql.Dsl._` to make DSL query works.
      Since queryExecution is not avaliable in DataFrame, so remove it.
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
      
      Closes #4330 from OopsOutOfMemory/hiveconsole and squashes the following commits:
      
      46eb790 [Sheng, Li] Update SparkBuild.scala
      d23ee9f [OopsOutOfMemory] minor
      d4dd593 [OopsOutOfMemory] refine hive console
      b73d5fff
  9. Feb 03, 2015
    • Xiangrui Meng's avatar
      [SPARK-5536] replace old ALS implementation by the new one · 0cc7b88c
      Xiangrui Meng authored
      The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
      
      CC: srowen coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:
      
      5a3cee8 [Xiangrui Meng] update python tests that are too strict
      e840acf [Xiangrui Meng] ignore scala style check for ALS.train
      e9a721c [Xiangrui Meng] update mima excludes
      9ee6a36 [Xiangrui Meng] merge master
      9a8aeac [Xiangrui Meng] update tests
      d8c3271 [Xiangrui Meng] remove analyzeBlocks
      d68eee7 [Xiangrui Meng] add checkpoint to new ALS
      22a56f8 [Xiangrui Meng] wrap old ALS
      c387dff [Xiangrui Meng] support random seed
      3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
      0cc7b88c
  10. Feb 02, 2015
    • Davies Liu's avatar
      [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
      Davies Liu authored
      This PR brings the Python API for Spark Streaming Kafka data source.
      
      ```
          class KafkaUtils(__builtin__.object)
           |  Static methods defined here:
           |
           |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
      2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
           |      Create an input stream that pulls messages from a Kafka Broker.
           |
           |      :param ssc:  StreamingContext object
           |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
           |      :param groupId:  The group id for this consumer.
           |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
           |                      Each partition is consumed in its own thread.
           |      :param storageLevel:  RDD storage level.
           |      :param keyDecoder:  A function used to decode key
           |      :param valueDecoder:  A function used to decode value
           |      :return: A DStream object
      ```
      run the example:
      
      ```
      bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Tathagata Das <tdas@databricks.com>
      
      Closes #3715 from davies/kafka and squashes the following commits:
      
      d93bfe0 [Davies Liu] Update make-distribution.sh
      4280d04 [Davies Liu] address comments
      e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      f257071 [Davies Liu] add tests for null in RDD
      23b039a [Davies Liu] address comments
      9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
      a74da87 [Davies Liu] address comments
      dc1eed0 [Davies Liu] Update kafka_wordcount.py
      31e2317 [Davies Liu] Update kafka_wordcount.py
      370ba61 [Davies Liu] Update kafka.py
      97386b3 [Davies Liu] address comment
      2c567a5 [Davies Liu] update logging and comment
      33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
      aea8953 [Tathagata Das] Kafka-assembly for Python API
      eea16a7 [Davies Liu] refactor
      f6ce899 [Davies Liu] add example and fix bugs
      98c8d17 [Davies Liu] fix python style
      5697a01 [Davies Liu] bypass decoder in scala
      048dbe6 [Davies Liu] fix python style
      75d485e [Davies Liu] add mqtt
      07923c4 [Davies Liu] support kafka in Python
      0561c454
    • Xiangrui Meng's avatar
      [SPARK-5540] hide ALS.solveLeastSquares · ef65cf09
      Xiangrui Meng authored
      This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I think no one calls it directly.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4318 from mengxr/SPARK-5540 and squashes the following commits:
      
      586ade6 [Xiangrui Meng] hide ALS.solveLeastSquares
      ef65cf09
    • Joseph K. Bradley's avatar
      [SPARK-5461] [graphx] Add isCheckpointed, getCheckpointedFiles methods to Graph · 842d0003
      Joseph K. Bradley authored
      Added the 2 methods to Graph and GraphImpl.  Both make calls to the underlying vertex and edge RDDs.
      
      This is needed for another PR (for LDA): [https://github.com/apache/spark/pull/4047]
      
      Notes:
      * getCheckpointedFiles is plural and returns a Seq[String] instead of an Option[String].
      * I attempted to test to make sure the methods returned the correct values after checkpointing.  It did not work; I guess that checkpointing does not occur quickly enough?  I noticed that there are not checkpointing tests for RDDs; is it just hard to test well?
      
      CC: rxin
      
      CC: mengxr  (since related to LDA)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4253 from jkbradley/graphx-checkpoint and squashes the following commits:
      
      b680148 [Joseph K. Bradley] added class tag to firstParent call in VertexRDDImpl.isCheckpointed, though not needed to compile
      250810e [Joseph K. Bradley] In EdgeRDDImple, VertexRDDImpl, added transient back to partitionsRDD, and made isCheckpointed check firstParent instead of partitionsRDD
      695b7a3 [Joseph K. Bradley] changed partitionsRDD in EdgeRDDImpl, VertexRDDImpl to be non-transient
      cc00767 [Joseph K. Bradley] added overrides for isCheckpointed, getCheckpointFile in EdgeRDDImpl, VertexRDDImpl. The corresponding Graph methods now work.
      188665f [Joseph K. Bradley] improved documentation
      235738c [Joseph K. Bradley] Added isCheckpointed and getCheckpointFiles to Graph, GraphImpl
      842d0003
  11. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-5430] move treeReduce and treeAggregate from mllib to core · 4ee79c71
      Xiangrui Meng authored
      We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4228 from mengxr/SPARK-5430 and squashes the following commits:
      
      20ad40d [Xiangrui Meng] exclude tree* from mima
      e89a43e [Xiangrui Meng] fix compile and update java doc
      3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python
      6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike
      d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core
      4ee79c71
    • Ryan Williams's avatar
      [SPARK-5415] bump sbt to version to 0.13.7 · 661d3f9f
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #4211 from ryan-williams/sbt0.13.7 and squashes the following commits:
      
      e28476d [Ryan Williams] bump sbt to version to 0.13.7
      661d3f9f
  12. Jan 27, 2015
    • Reynold Xin's avatar
      [SPARK-5097][SQL] DataFrame · 119f45d6
      Reynold Xin authored
      This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.
      
      TODOs:
      With the exception of Python support, other tasks can be done in separate, follow-up PRs.
      - [ ] Audit of the API
      - [ ] Documentation
      - [ ] More test cases to cover the new API
      - [x] Python support
      - [ ] Type alias SchemaRDD
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4173 from rxin/df1 and squashes the following commits:
      
      0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
      23b4427 [Reynold Xin] Mima.
      828f70d [Reynold Xin] Merge pull request #7 from davies/df
      257b9e6 [Davies Liu] add repartition
      6bf2b73 [Davies Liu] fix collect with UDT and tests
      e971078 [Reynold Xin] Missing quotes.
      b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
      a728bf2 [Reynold Xin] Example rename.
      e8aa3d3 [Reynold Xin] groupby -> groupBy.
      9662c9e [Davies Liu] improve DataFrame Python API
      4ae51ea [Davies Liu] python API for dataframe
      1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
      2ca74db [Reynold Xin] Couple minor fixes.
      ea98ea1 [Reynold Xin] Documentation & literal expressions.
      2b22684 [Reynold Xin] Got rid of IntelliJ problems.
      02bbfbc [Reynold Xin] Tightening imports.
      ffbce66 [Reynold Xin] Fixed compilation error.
      59b6d8b [Reynold Xin] Style violation.
      b85edfb [Reynold Xin] ALS.
      8c37f0a [Reynold Xin] Made MLlib and examples compile
      6d53134 [Reynold Xin] Hive module.
      d35efd5 [Reynold Xin] Fixed compilation error.
      ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
      66d5ef1 [Reynold Xin] SQLContext minor patch.
      c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
      119f45d6
    • Burak Yavuz's avatar
      [SPARK-5321] Support for transposing local matrices · 91426748
      Burak Yavuz authored
      Support for transposing local matrices added. The `.transpose` function creates a new object re-using the backing array(s) but switches `numRows` and `numCols`. Operations check the flag `.isTransposed` to see whether the indexing in `values` should be modified.
      
      This PR will pave the way for transposing `BlockMatrix`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #4109 from brkyvz/SPARK-5321 and squashes the following commits:
      
      87ab83c [Burak Yavuz] fixed scalastyle
      caf4438 [Burak Yavuz] addressed code review v3
      c524770 [Burak Yavuz] address code review comments 2
      77481e8 [Burak Yavuz] fixed MiMa
      f1c1742 [Burak Yavuz] small refactoring
      ccccdec [Burak Yavuz] fixed failed test
      dd45c88 [Burak Yavuz] addressed code review
      a01bd5f [Burak Yavuz] [SPARK-5321] Fixed MiMa issues
      2a63593 [Burak Yavuz] [SPARK-5321] fixed bug causing failed gemm test
      c55f29a [Burak Yavuz] [SPARK-5321] Support for transposing local matrices cleaned up
      c408c05 [Burak Yavuz] [SPARK-5321] Support for transposing local matrices added
      91426748
  13. Jan 23, 2015
    • jerryshao's avatar
      [SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug · e0f7fb7f
      jerryshao authored
      `reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible.
      
      Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution?
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4104 from jerryshao/SPARK-5315 and squashes the following commits:
      
      5bc8987 [jerryshao] Address the comment
      c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible
      8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error
      e0f7fb7f
  14. Jan 21, 2015
  15. Jan 20, 2015
    • Sean Owen's avatar
      SPARK-5270 [CORE] Provide isEmpty() function in RDD API · 306ff187
      Sean Owen authored
      Pretty minor, but submitted for consideration -- this would at least help people make this check in the most efficient way I know.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4074 from srowen/SPARK-5270 and squashes the following commits:
      
      66885b8 [Sean Owen] Add note that JavaRDDLike should not be implemented by user code
      2e9b490 [Sean Owen] More tests, and Mima-exclude the new isEmpty method in JavaRDDLike
      28395ff [Sean Owen] Add isEmpty to Java, Python
      7dd04b7 [Sean Owen] Add efficient RDD.isEmpty()
      306ff187
  16. Jan 17, 2015
    • Michael Armbrust's avatar
      [SPARK-5096] Use sbt tasks instead of vals to get hadoop version · 6999910b
      Michael Armbrust authored
      This makes it possible to compile spark as an external `ProjectRef` where as now we throw a `FileNotFoundException`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3905 from marmbrus/effectivePom and squashes the following commits:
      
      fd63aae [Michael Armbrust] Use sbt tasks instead of vals to get hadoop version.
      6999910b
  17. Jan 16, 2015
  18. Jan 14, 2015
    • Josh Rosen's avatar
      [SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId · 259936be
      Josh Rosen authored
      `TaskContext.attemptId` is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.
      
      This patch deprecates `TaskContext.attemptId` and add `TaskContext.taskId` and `TaskContext.attemptNumber` fields.  Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.
      
      Earlier versions of the TaskContext docs suggest that `attemptId` behaves like `attemptNumber`, so there's an argument to be made in favor of changing this method's implementation.  Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for `attemptId`; if `attemptId` behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new `attemptId` behavior.
      
      Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the `data` field alongside the task binary.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3849 from JoshRosen/SPARK-4014 and squashes the following commits:
      
      89d03e0 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      5cfff05 [Josh Rosen] Introduce wrapper for serializing Mesos task launch data.
      38574d4 [Josh Rosen] attemptId -> taskAttemptId in PairRDDFunctions
      a180b88 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      1d43aa6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      eee6a45 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      0b10526 [Josh Rosen] Use putInt instead of putLong (silly mistake)
      8c387ce [Josh Rosen] Use local with maxRetries instead of local-cluster.
      cbe4d76 [Josh Rosen] Preserve attemptId behavior and deprecate it:
      b2dffa3 [Josh Rosen] Address some of Reynold's minor comments
      9d8d4d1 [Josh Rosen] Doc typo
      1e7a933 [Josh Rosen] [SPARK-4014] Change TaskContext.attemptId to return attempt number instead of task ID.
      fd515a5 [Josh Rosen] Add failing test for SPARK-4014
      259936be
  19. Jan 13, 2015
    • Reynold Xin's avatar
      [SPARK-5123][SQL] Reconcile Java/Scala API for data types. · f9969098
      Reynold Xin authored
      Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.
      
      As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.
      
      This subsumes https://github.com/apache/spark/pull/3925
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:
      
      66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
      f9969098
  20. Jan 10, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5032] [graphx] Remove GraphX MIMA exclude for 1.3 · 33132609
      Joseph K. Bradley authored
      Since GraphX is no longer alpha as of 1.2, MimaExcludes should not exclude GraphX for 1.3
      
      Here are the individual excludes I had to add + the associated commits:
      
      ```
                  // SPARK-4444
                  ProblemFilters.exclude[IncompatibleResultTypeProblem](
                    "org.apache.spark.graphx.EdgeRDD.fromEdges"),
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.EdgeRDD.filter"),
                  ProblemFilters.exclude[IncompatibleResultTypeProblem](
                    "org.apache.spark.graphx.impl.EdgeRDDImpl.filter"),
      ```
      [https://github.com/apache/spark/commit/9ac2bb18ede2e9f73c255fa33445af89aaf8a000]
      
      ```
                  // SPARK-3623
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.Graph.checkpoint")
      ```
      [https://github.com/apache/spark/commit/e895e0cbecbbec1b412ff21321e57826d2d0a982]
      
      ```
                  // SPARK-4620
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.Graph.unpersist"),
      ```
      [https://github.com/apache/spark/commit/8817fc7fe8785d7b11138ca744f22f7e70f1f0a0]
      
      CC: rxin
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3856 from jkbradley/graphx-mima and squashes the following commits:
      
      1eea2f6 [Joseph K. Bradley] moved cleanup to run-tests
      527ccd9 [Joseph K. Bradley] fixed jenkins script to remove ivy2 cache
      802e252 [Joseph K. Bradley] Removed GraphX MIMA excludes and added line to clear spark from .m2 dir before Jenkins tests.  This may not work yet...
      30f8bb4 [Joseph K. Bradley] added individual mima excludes for graphx
      a3fea42 [Joseph K. Bradley] removed graphx mima exclude for 1.3
      33132609
  21. Jan 02, 2015
    • Yadong Qi's avatar
      [SPARK-3325][Streaming] Add a parameter to the method print in class DStream · bd88b718
      Yadong Qi authored
      This PR is a fixed version of the original PR #3237 by watermen and scwf.
      This adds the ability to specify how many elements to print in `DStream.print`.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      Author: q00251598 <qiyadong@huawei.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3865 from tdas/print-num and squashes the following commits:
      
      cd34e9e [Tathagata Das] Fix bug
      7c09f16 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into HEAD
      bb35d1a [Yadong Qi] Update MimaExcludes.scala
      f8098ca [Yadong Qi] Update MimaExcludes.scala
      f6ac3cb [Yadong Qi] Update MimaExcludes.scala
      e4ed897 [Yadong Qi] Update MimaExcludes.scala
      3b9d5cf [wangfei] fix conflicts
      ec8a3af [q00251598] move to  Spark 1.3
      26a70c0 [q00251598] extend the Python DStream's print
      b589a4b [q00251598] add another print function
      bd88b718
  22. Dec 31, 2014
    • Sean Owen's avatar
      SPARK-2757 [BUILD] [STREAMING] Add Mima test for Spark Sink after 1.10 is released · 4bb12488
      Sean Owen authored
      Re-enable MiMa for Streaming Flume Sink module, now that 1.1.0 is released, per the JIRA TO-DO. That's pretty much all there is to this.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3842 from srowen/SPARK-2757 and squashes the following commits:
      
      50ff80e [Sean Owen] Exclude apparent false positive turned up by re-enabling MiMa checks for Streaming Flume Sink
      0e5ba5c [Sean Owen] Re-enable MiMa for Streaming Flume Sink module
      4bb12488
  23. Dec 27, 2014
    • Patrick Wendell's avatar
      HOTFIX: Slight tweak on previous commit. · 82bf4bee
      Patrick Wendell authored
      Meant to merge this in when committing SPARK-3787.
      82bf4bee
    • Kousuke Saruta's avatar
      [SPARK-3787][BUILD] Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version · de95c57a
      Kousuke Saruta authored
      This PR is another solution for When we build with sbt with profile for hadoop and without property for hadoop version like:
      
          sbt/sbt -Phadoop-2.2 assembly
      
      jar name is always used default version (1.0.4).
      
      When we build with maven with same condition for sbt, default version for each profile is used.
      For instance, if we  build like:
      
          mvn -Phadoop-2.2 package
      
      jar name is used hadoop2.2.0 as a default version of hadoop-2.2.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3046 from sarutak/fix-assembly-jarname-2 and squashes the following commits:
      
      41ef90e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname-2
      50c8676 [Kousuke Saruta] Merge branch 'fix-assembly-jarname-2' of github.com:sarutak/spark into fix-assembly-jarname-2
      52a1cd2 [Kousuke Saruta] Fixed comflicts
      dd30768 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname2
      f1c90bb [Kousuke Saruta] Fixed SparkBuild.scala in order to read `hadoop.version` property from pom.xml
      af6b100 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      c81806b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      ad1f96e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      b2318eb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      5fc1259 [Kousuke Saruta] Fixed typo.
      eebbb7d [Kousuke Saruta] Fixed wrong jar name
      de95c57a
  24. Dec 19, 2014
    • scwf's avatar
      [Build] Remove spark-staging-1038 · 8e253ebb
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3743 from scwf/abc and squashes the following commits:
      
      7d98bc8 [scwf] removing spark-staging-1038
      8e253ebb
  25. Dec 15, 2014
    • Sean Owen's avatar
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from... · 81112e4b
      Sean Owen authored
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
      
      This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3692 from srowen/SPARK-4814 and squashes the following commits:
      
      caca704 [Sean Owen] Disable assertions just for Hive
      f71e783 [Sean Owen] Enable assertions for SBT and Maven build
      81112e4b
  26. Dec 09, 2014
    • Sandy Ryza's avatar
      SPARK-4338. [YARN] Ditch yarn-alpha. · 912563aa
      Sandy Ryza authored
      Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:
      
      1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
      9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
      912563aa
  27. Dec 04, 2014
    • lewuathe's avatar
      [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group · 20bfea4a
      lewuathe authored
      This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`.
      
      Closes #3554
      
      jkbradley
      
      Author: lewuathe <lewuathe@me.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits:
      
      184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc
      f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
      20bfea4a
  28. Nov 28, 2014
    • Takuya UESHIN's avatar
      [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. · e464f0ac
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:
      
      e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
      6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
      fdb280a [Takuya UESHIN] Fix Javadoc errors.
      4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
      923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
      30d6718 [Takuya UESHIN] Fix Javadoc errors.
      b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
      e464f0ac
  29. Nov 26, 2014
    • Xiangrui Meng's avatar
      [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices · 561d31d2
      Xiangrui Meng authored
      Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:
      
      3b0e4e2 [Xiangrui Meng] add mima excludes
      6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
      561d31d2
  30. Nov 19, 2014
    • Joseph E. Gonzalez's avatar
      Updating GraphX programming guide and documentation · 377b0682
      Joseph E. Gonzalez authored
      This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
      
      4421964 [Joseph E. Gonzalez] updating documentation for graphx
      377b0682
    • Takuya UESHIN's avatar
      [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails. · f9adda9a
      Takuya UESHIN authored
      I tried to build for Scala 2.11 using sbt with the following command:
      
      ```
      $ sbt/sbt -Dscala-2.11 assembly
      ```
      
      but it ends with the following error messages:
      
      ```
      [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
      [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
      ```
      
      The reason is:
      If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.
      
      The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:
      
      14d86e8 [Takuya UESHIN] Add a comment.
      4eef52b [Takuya UESHIN] Remove unneeded condition.
      ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
      f9adda9a
    • Andrew Or's avatar
      [HOT FIX] MiMa tests are broken · 0df02ca4
      Andrew Or authored
      This is blocking #3353 and other patches.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits:
      
      842d059 [Andrew Or] Move excludes to the right section
      c4d4f4e [Andrew Or] MIMA hot fix
      0df02ca4
  31. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
    • Davies Liu's avatar
      [SPARK-4017] show progress bar in console · e34f38ff
      Davies Liu authored
      The progress bar will look like this:
      
      ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
      
      In the right corner, the numbers are: finished tasks, running tasks, total tasks.
      
      After the stage has finished, it will disappear.
      
      The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3029 from davies/progress and squashes the following commits:
      
      95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      fc49ac8 [Davies Liu] address commentse
      2e90f75 [Davies Liu] show multiple stages in same time
      0081bcc [Davies Liu] address comments
      38c42f1 [Davies Liu] fix tests
      ab87958 [Davies Liu] disable progress bar during tests
      30ac852 [Davies Liu] re-implement progress bar
      b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
      e4e7344 [Davies Liu] refactor
      e1f524d [Davies Liu] revert unnecessary change
      a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      5cae3f2 [Davies Liu] fix style
      ea49fe0 [Davies Liu] address comments
      bc53d99 [Davies Liu] refactor
      e6bb189 [Davies Liu] fix logging in sparkshell
      7e7d4e7 [Davies Liu] address commments
      5df26bb [Davies Liu] fix style
      9e42208 [Davies Liu] show progress bar in console and title
      e34f38ff
Loading