Skip to content
Snippets Groups Projects
  1. May 22, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler · 2728c3df
      Joseph K. Bradley authored
      Added user guide sections with code examples.
      Also added small Java unit tests to test Java example in guide.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits:
      
      cd47f4b [Joseph K. Bradley] Updated based on code review
      f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
      0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
      a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
      2728c3df
  2. May 21, 2015
    • Mike Dusenberry's avatar
      [DOCS] [MLLIB] Fixing broken link in MLlib Linear Methods documentation. · e4136ea6
      Mike Dusenberry authored
      Just a small change: fixed a broken link in the MLlib Linear Methods documentation by removing a newline character between the link title and link address.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6340 from dusenberrymw/Fix_MLlib_Linear_Methods_link and squashes the following commits:
      
      0a57818 [Mike Dusenberry] Fixing broken link in MLlib Linear Methods documentation.
      e4136ea6
    • Joseph K. Bradley's avatar
      [SPARK-7585] [ML] [DOC] VectorIndexer user guide section · 6d75ed7e
      Joseph K. Bradley authored
      Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:
      
      dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
      f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      6d75ed7e
    • Xiangrui Meng's avatar
      [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType · 13348e21
      Xiangrui Meng authored
      to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:
      
      f38b662 [Xiangrui Meng] add another case _ back in test
      ae5c66a [Xiangrui Meng] model type -> modelType
      711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
      40ae53e [Xiangrui Meng] fix Java test suite
      264a814 [Xiangrui Meng] add case _ back
      3c456a8 [Xiangrui Meng] update NB user guide
      17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
      13348e21
  3. May 20, 2015
    • Hari Shreedharan's avatar
      [SPARK-7750] [WEBUI] Rename endpoints from `json` to `api` to allow fu… · a70bf06b
      Hari Shreedharan authored
      …rther extension to non-json outputs too.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6273 from harishreedharan/json-to-api and squashes the following commits:
      
      e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i
      42f8acb [Hari Shreedharan] Import order fixes.
      2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.
      a70bf06b
    • Sandy Ryza's avatar
      [SPARK-7579] [ML] [DOC] User guide update for OneHotEncoder · 829f1d95
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #6126 from sryza/sandy-spark-7579 and squashes the following commits:
      
      5af803d [Sandy Ryza] SPARK-7579 [MLLIB] User guide update for OneHotEncoder
      829f1d95
    • ehnalis's avatar
      [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats. · 3ddf051e
      ehnalis authored
      Added faster RM-heartbeats on pending container allocations with multiplicative back-off.
      Also updated related documentations.
      
      Author: ehnalis <zoltan.zvara@gmail.com>
      
      Closes #6082 from ehnalis/yarn and squashes the following commits:
      
      a1d2101 [ehnalis] MIss-spell fixed.
      90f8ba4 [ehnalis] Changed default HB values.
      6120295 [ehnalis] Removed the bug, when allocation heartbeat would not start from initial value.
      08bac63 [ehnalis] Refined style, grammar, removed duplicated code.
      073d283 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
      d4408c9 [ehnalis] [SPARK-7533] [YARN] Decrease spacing between AM-RM heartbeats.
      3ddf051e
  4. May 19, 2015
    • Mike Dusenberry's avatar
      [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types"... · 38605206
      Mike Dusenberry authored
      [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types" documentation should be reordered.
      
      The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits:
      
      6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix.  This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader.
      38605206
    • Xusen Yin's avatar
      [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml package · 68fb2a46
      Xusen Yin authored
      CC jkbradley.
      
      JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits:
      
      77014c5 [Xusen Yin] comment fix
      57a4c07 [Xusen Yin] small fix for docs
      1178c8f [Xusen Yin] remove the correctness check in java suite
      1c3f389 [Xusen Yin] delete sbt commit
      1af152b [Xusen Yin] check python example code
      1b5369e [Xusen Yin] add docs of word2vec
      68fb2a46
    • Dice's avatar
      [SPARK-7704] Updating Programming Guides per SPARK-4397 · 32fa611b
      Dice authored
      The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher)
      
      Author: Dice <poleon.kd@gmail.com>
      
      Closes #6234 from daisukebe/patch-1 and squashes the following commits:
      
      b77ecd9 [Dice] fix a typo
      45dfcd3 [Dice] rewording per Sean's advice
      a094bcf [Dice] Adding a note for users on any previous releases
      a29be5f [Dice] Updating Programming Guides per SPARK-4397
      32fa611b
    • Saleem Ansari's avatar
      [SPARK-7723] Fix string interpolation in pipeline examples · df34793a
      Saleem Ansari authored
      https://issues.apache.org/jira/browse/SPARK-7723
      
      Author: Saleem Ansari <tuxdna@gmail.com>
      
      Closes #6258 from tuxdna/master and squashes the following commits:
      
      2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline
      e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples
      df34793a
    • Mike Dusenberry's avatar
      Fixing a few basic typos in the Programming Guide. · 61f164d3
      Mike Dusenberry authored
      Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits:
      
      ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide.
      61f164d3
    • Xusen Yin's avatar
      [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion · 6008ec14
      Xusen Yin authored
      JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).
      
      CC jkbradley
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:
      
      1a7d80d [Xusen Yin] merge with master
      892a8e9 [Xusen Yin] fix python 3 compatibility
      ec935bf [Xusen Yin] small fix
      3e9fa1d [Xusen Yin] delete note
      69fcf85 [Xusen Yin] simplify and add python example
      81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
      40babfb [Xusen Yin] add java test suite for PolynomialExpansion
      6008ec14
  5. May 18, 2015
    • Vincenzo Selvaggio's avatar
      [SPARK-7272] [MLLIB] User guide for PMML model export · 814b3dab
      Vincenzo Selvaggio authored
      https://issues.apache.org/jira/browse/SPARK-7272
      
      Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
      
      Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits:
      
      c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export
      d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
      814b3dab
  6. May 16, 2015
  7. May 15, 2015
  8. May 14, 2015
    • FavioVazquez's avatar
      [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the versions · 7fb715de
      FavioVazquez authored
      Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons.
      
      Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly.
      
      Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned.
      
      Author: FavioVazquez <favio.vazquezp@gmail.com>
      
      Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits:
      
      11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh
      379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior
      3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies
      31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation
      cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about  hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies
      83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml
      93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM
      668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
      fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh  due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
      0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0
      a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml
      199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that.
      88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file
      70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles
      287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc.
      1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation
      6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff.
      7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
      660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
      ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
      7fb715de
  9. May 12, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer · f0c1bc34
      Joseph K. Bradley authored
      Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.
      
      I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:
      
      d5d213f [Joseph K. Bradley] small fix
      dd6e91a [Joseph K. Bradley] fixes from code review of user guide
      33c3ff9 [Joseph K. Bradley] small fix
      bc6058c [Joseph K. Bradley] fix link
      361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
      f0c1bc34
    • Yuhao Yang's avatar
      [SPARK-7496] [MLLIB] Update Programming guide with Online LDA · 1d703660
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7496
      
      Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:
      
      4b6fbfa [Yuhao Yang] add online paper and some comparison
      fd4c983 [Yuhao Yang] update lda document for optimizers
      1d703660
    • vidmantas zemleris's avatar
      [SPARK-6994][SQL] Update docs for fetching Row fields by name · 640f63b9
      vidmantas zemleris authored
      add docs for https://issues.apache.org/jira/browse/SPARK-6994
      
      Author: vidmantas zemleris <vidmantas@vinted.com>
      
      Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:
      
      241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
      640f63b9
  10. May 11, 2015
  11. May 10, 2015
    • Kirill A. Korinskiy's avatar
      [SPARK-5521] PCA wrapper for easy transform vectors · 8c07c75c
      Kirill A. Korinskiy authored
      I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
      
      Example of usage:
      ```
        import org.apache.spark.mllib.regression.LinearRegressionWithSGD
        import org.apache.spark.mllib.regression.LabeledPoint
        import org.apache.spark.mllib.linalg.Vectors
        import org.apache.spark.mllib.feature.PCA
      
        val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
          val parts = line.split(',')
          LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
        }.cache()
      
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)
      
        val pca = PCA.create(training.first().features.size/2, data.map(_.features))
        val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
        val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
      
        val numIterations = 100
        val model = LinearRegressionWithSGD.train(training, numIterations)
        val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
      
        val valuesAndPreds = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
      
        val valuesAndPreds_pca = test_pca.map { point =>
          val score = model_pca.predict(point.features)
          (score, point.label)
        }
      
        val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
        val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
      
        println("Mean Squared Error = " + MSE)
        println("PCA Mean Squared Error = " + MSE_pca)
      ```
      
      Author: Kirill A. Korinskiy <catap@catap.ru>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4304 from catap/pca and squashes the following commits:
      
      501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
      9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
      1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
      8c07c75c
  12. May 09, 2015
    • dobashim's avatar
      [STREAMING] [DOCS] Fix wrong url about API docs of StreamingListener · 7d0f1720
      dobashim authored
      A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)
      
      Author: dobashim <dobashim@oss.nttdata.co.jp>
      
      Closes #6024 from dobashim/master and squashes the following commits:
      
      ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
      7d0f1720
  13. May 08, 2015
    • Imran Rashid's avatar
      [SPARK-3454] separate json endpoints for data in the UI · c796be70
      Imran Rashid authored
      Exposes data available in the UI as json over http.  Key points:
      
      * new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
      * Uses jersey + jackson for routing & converting POJOs into json
      * tests against known results in `HistoryServerSuite`
      * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:
      
      1a72ed6 [Imran Rashid] rats
      85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
      1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
      1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
      4e12013 [Imran Rashid] just use test case name for expectation file name
      863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
      c796be70
  14. May 07, 2015
    • Octavian Geagla's avatar
      [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer · 658a478d
      Octavian Geagla authored
      See https://issues.apache.org/jira/browse/SPARK-5726
      
      Author: Octavian Geagla <ogeagla@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:
      
      fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
      90f7e39 [Joseph K. Bradley] small cleanups
      4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
      ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
      37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
      1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
      e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
      cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
      4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
      658a478d
    • ksonj's avatar
      [SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API · fae4e2d6
      ksonj authored
      Author: ksonj <kson@siberie.de>
      
      Closes #5971 from ksonj/doc and squashes the following commits:
      
      dadfebb [ksonj] __getitem__ is cleaner than __getattr__
      fae4e2d6
  15. May 05, 2015
    • Reynold Xin's avatar
      Revert "[SPARK-3454] separate json endpoints for data in the UI" · 51b3d41e
      Reynold Xin authored
      This reverts commit d4973580.
      
      The commit broke Spark on Windows.
      51b3d41e
    • zsxwing's avatar
      [SPARK-7351] [STREAMING] [DOCS] Add spark.streaming.ui.retainedBatches to docs · fec7b29f
      zsxwing authored
      The default value will be changed to `1000` in #5533. So here I just used `1000`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5899 from zsxwing/SPARK-7351 and squashes the following commits:
      
      e1ec515 [zsxwing] [SPARK-7351][Streaming][Docs] Add spark.streaming.ui.retainedBatches to docs
      fec7b29f
    • Imran Rashid's avatar
      [SPARK-3454] separate json endpoints for data in the UI · d4973580
      Imran Rashid authored
      Exposes data available in the UI as json over http.  Key points:
      
      * new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
      * Uses jersey + jackson for routing & converting POJOs into json
      * tests against known results in `HistoryServerSuite`
      * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #4435 from squito/SPARK-3454 and squashes the following commits:
      
      da1e35f [Imran Rashid] typos etc.
      5e78b4f [Imran Rashid] fix rendering problems
      5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454
      f016182 [Imran Rashid] change all constructors json-pojo class constructors to be private[spark] to protect us from mima-false-positives if we add fields
      3347b72 [Imran Rashid] mark EnumUtil as @Private
      ec140a2 [Imran Rashid] create @Private
      cc1febf [Imran Rashid] add docs on the metrics-as-json api
      cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454
      56db31e [Imran Rashid] update tests for mulit-attempt
      7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
      67008b4 [Imran Rashid] rats
      9e51400 [Imran Rashid] style
      c9bae1c [Imran Rashid] handle multiple attempts per app
      b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
      188762c [Imran Rashid] multi-attempt
      2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454
      befff0c [Imran Rashid] review feedback
      14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & scope to parent pom.xml
      f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454
      dc8a7fe [Imran Rashid] style, fix errant comments
      acb7ef6 [Imran Rashid] fix indentation
      7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; comments
      9d889d6 [Imran Rashid] undo some unnecessary changes
      f48a7b0 [Imran Rashid] docs
      52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to synchronize internally to be thread-safe
      31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES
      b2f8b91 [Imran Rashid] @DeveloperApi
      2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory overhead
      ba3d9d2 [Imran Rashid] upper case enums
      39ac29c [Imran Rashid] move EnumUtil
      d2bde77 [Imran Rashid] update error handling & scoping
      4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential version conflicts
      a157a2f [Imran Rashid] style
      7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything
      a325563 [Imran Rashid] style
      a9c5cf1 [Imran Rashid] undo changes superceeded by master
      0c6f968 [Imran Rashid] update deps
      1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454
      4c92af6 [Imran Rashid] style
      f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454
      c22b11f [Imran Rashid] fix compile error
      9ea682c [Imran Rashid] go back to good ol' java enums
      cf86175 [Imran Rashid] style
      d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454
      f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :(
      101a698 [Imran Rashid] style
      d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the application listing more often
      b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
      e031719 [Imran Rashid] fixes from review
      1f53a66 [Imran Rashid] style
      b4a7863 [Imran Rashid] fix compile error
      2c8b7ee [Imran Rashid] rats
      1578a4a [Imran Rashid] doc
      674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages vs. number retained
      9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions
      f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since there is no substitute yet
      db61211 [Imran Rashid] get JobProgressListener directly from UI
      fdfc181 [Imran Rashid] stage/taskList
      63eb4a6 [Imran Rashid] tests for taskSummary
      ad27de8 [Imran Rashid] error handling on quantile values
      b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one resource
      aaba896 [Imran Rashid] wire up task summary
      a4b1397 [Imran Rashid] stage metric distributions
      e48ba32 [Imran Rashid] rename
      eaf3bbb [Imran Rashid] style
      25cd894 [Imran Rashid] if only given day, assume GMT
      51eaedb [Imran Rashid] more visibility fixes
      9f28b7e [Imran Rashid] ack, more cleanup
      99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454
      a61a43c [Imran Rashid] oops, remove accidental checkin
      a066055 [Imran Rashid] set visibility on a lot of classes
      1f361c8 [Imran Rashid] update rat-excludes
      0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      2382bef [Imran Rashid] switch to using new "enum"
      fef6605 [Imran Rashid] some utils for working w/ new "enum" format
      dbfc7bf [Imran Rashid] style
      b86bcb0 [Imran Rashid] update test to look at one stage attempt
      5f9df24 [Imran Rashid] style
      7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication
      73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & jobs
      97d411f [Imran Rashid] json endpoint for one job
      0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId
      dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for all attempts for one stage
      190c17a [Imran Rashid] StagePage should distinguish no task data, from unknown stage
      84cd497 [Imran Rashid] AllJobsPage should still report correct completed & failed job count, even if some have been cleaned, to make it consistent w/ AllStagesPage
      36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can list its own applicationInfo
      b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a few cases
      e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work
      f7df095 [Imran Rashid] add test for accumulables, and discover that I need update after all
      9c0c125 [Imran Rashid] add accumulableInfo
      00e9cc5 [Imran Rashid] more style
      3377e61 [Imran Rashid] scaladoc
      d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since they have binary compatibility issues
      654cecf [Imran Rashid] move all the status api POJOs to one file
      b86e2b0 [Imran Rashid] style
      18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      5598f19 [Imran Rashid] delete some unnecessary code, more to go
      56edce0 [Imran Rashid] style
      017c755 [Imran Rashid] add in metrics now available
      1b78cb7 [Imran Rashid] fix some import ordering
      0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving up
      c7d884f [Imran Rashid] fix merge conflicts
      0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
      b6a96a8 [Imran Rashid] compare json by AST, not string
      cd37845 [Imran Rashid] switch to using java.util.Dates for times
      a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven wasn't happy before this
      4fdc39f [Imran Rashid] refactor case insensitive enum parsing
      cba1ef6 [Imran Rashid] add security (maybe?) for metrics json
      f0264a7 [Imran Rashid] switch to using jersey for metrics json
      bceb3a9 [Imran Rashid] set http response code on error, some testing
      e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is this OK?)
      b252e7a [Imran Rashid] small cleanup of accidental changes
      d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
      4b398d0 [Imran Rashid] expose UI data as json in new endpoints
      d4973580
    • Sandy Ryza's avatar
      [SPARK-5112] Expose SizeEstimator as a developer api · 4222da68
      Sandy Ryza authored
      "The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
      -the Tuning Spark page
      
      This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:
      
      8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
      2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
      93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
      e21c1f4 [Sandy Ryza] Remove unused import
      798ab88 [Sandy Ryza] Update documentation and add to SparkContext
      34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
      4222da68
    • shekhar.bansal's avatar
      [SPARK-6653] [YARN] New config to specify port for sparkYarnAM actor system · fc8feaa8
      shekhar.bansal authored
      Author: shekhar.bansal <shekhar.bansal@guavus.com>
      
      Closes #5719 from zuxqoj/master and squashes the following commits:
      
      5574ff7 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      5117258 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      9de5330 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for sparkYarnAM actor system
      456a592 [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
      803e93e [shekhar.bansal] [SPARK-6653][yarn] New configuration property to specify port for sparkYarnAM actor system
      fc8feaa8
  16. May 03, 2015
    • Sean Owen's avatar
      [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23 · 9e25b09f
      Sean Owen authored
      Remove references to Hadoop 0.23
      
      CC tgravescs Is this what you had in mind? basically all refs to 0.23?
      We don't support YARN 0.23, but also don't support Hadoop 0.23 anymore AFAICT. There are no builds or releases for it.
      
      In fact, on a related note, refs to CDH3 (Hadoop 0.20.2) should be removed as this certainly isn't supported either.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5863 from srowen/SPARK-7302 and squashes the following commits:
      
      42f5d1e [Sean Owen] Remove CDH3 (Hadoop 0.20.2) refs too
      dad02e3 [Sean Owen] Remove references to Hadoop 0.23
      9e25b09f
  17. May 02, 2015
    • BenFradet's avatar
      [SPARK-7255] [STREAMING] [DOCUMENTATION] Added documentation for spark.streaming.kafka.maxRetries · ea841efc
      BenFradet authored
      Added documentation for spark.streaming.kafka.maxRetries
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #5808 from BenFradet/master and squashes the following commits:
      
      cc72e7a [BenFradet] updated doc for spark.streaming.kafka.maxRetries to explain the default value
      18f823e [BenFradet] Added "consecutive" to the spark.streaming.kafka.maxRetries doc
      597fdeb [BenFradet] Mention that spark.streaming.kafka.maxRetries only applies to the direct kafka api
      0efad39 [BenFradet] Added documentation for spark.streaming.kafka.maxRetries
      ea841efc
  18. May 01, 2015
    • Chris Heller's avatar
      [SPARK-2691] [MESOS] Support for Mesos DockerInfo · 8f50a07d
      Chris Heller authored
      This patch adds partial support for running spark on mesos inside of a docker container. Only fine-grained mode is presently supported, and there is no checking done to ensure that the version of libmesos is recent enough to have a DockerInfo structure in the protobuf (other than pinning a mesos version in the pom.xml).
      
      Author: Chris Heller <hellertime@gmail.com>
      
      Closes #3074 from hellertime/SPARK-2691 and squashes the following commits:
      
      d504af6 [Chris Heller] Assist type inference
      f64885d [Chris Heller] Fix errant line length
      17c41c0 [Chris Heller] Base Dockerfile on mesosphere/mesos image
      8aebda4 [Chris Heller] Simplfy Docker image docs
      1ae7f4f [Chris Heller] Style points
      974bd56 [Chris Heller] Convert map to flatMap
      5d8bdf7 [Chris Heller] Factor out the DockerInfo construction.
      7b75a3d [Chris Heller] Align to styleguide
      80108e7 [Chris Heller] Bend to the will of RAT
      ba77056 [Chris Heller] Explicit RAT exclude
      abda5e5 [Chris Heller] Wildcard .rat-excludes
      2f2873c [Chris Heller] Exclude spark-mesos from RAT
      a589a5b [Chris Heller] Add example Dockerfile
      b6825ce [Chris Heller] Remove use of EasyMock
      eae1b86 [Chris Heller] Move properties under 'spark.mesos.'
      c184d00 [Chris Heller] Use map on Option to be consistent with non-coarse code
      fb9501a [Chris Heller] Bumped mesos version to current release
      fa11879 [Chris Heller] Add listenerBus to EasyMock
      882151e [Chris Heller] Changes to scala style
      b22d42d [Chris Heller] Exclude template from RAT
      db536cf [Chris Heller] Remove unneeded mocks
      dea1bd5 [Chris Heller] Force default protocol
      7dac042 [Chris Heller] Add test for DockerInfo
      5456c0c [Chris Heller] Adjust syntax style
      521c194 [Chris Heller] Adjust version info
      6e38f70 [Chris Heller] Document Mesos Docker properties
      29572ab [Chris Heller] Support all DockerInfo fields
      b8c0dea [Chris Heller] Support for mesos DockerInfo in coarse-mode.
      482a9fd [Chris Heller] Support for mesos DockerInfo in fine-grained mode.
      8f50a07d
    • Hari Shreedharan's avatar
      [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS · b1f4ca82
      Hari Shreedharan authored
      Take 2. Does the same thing as #4688, but fixes Hadoop-1 build.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #5823 from harishreedharan/kerberos-longrunning and squashes the following commits:
      
      3c86bba [Hari Shreedharan] Import fixes. Import postfixOps explicitly.
      4d04301 [Hari Shreedharan] Minor formatting fixes.
      b5e7a72 [Hari Shreedharan] Remove reflection, use a method in SparkHadoopUtil to update the token renewer.
      7bff6e9 [Hari Shreedharan] Make sure all required classes are present in the jar. Fix import order.
      e851f70 [Hari Shreedharan] Move the ExecutorDelegationTokenRenewer to yarn module. Use reflection to use it.
      36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
      611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
      09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
      6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
      072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
      f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
      ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
      7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
      e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
      0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
      bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
      f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
      2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
      61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
      62c45ce [Hari Shreedharan] Relogin from keytab periodically.
      fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
      42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
      0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
      9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
      f4fd711 [Hari Shreedharan] Fix SparkConf usage.
      2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
      af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
      f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
      f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
      5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
      b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
      0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
      d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
      8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
      fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
      41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
      bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
      f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
      2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
      ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
      77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
      b1f4ca82
    • Marcelo Vanzin's avatar
      [SPARK-7281] [YARN] Add option to set AM's lib path in client mode. · 7b5dd3e3
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5813 from vanzin/SPARK-7281 and squashes the following commits:
      
      1cb6f42 [Marcelo Vanzin] [SPARK-7281] [yarn] Add option to set AM's lib path in client mode.
      7b5dd3e3
Loading