Skip to content
Snippets Groups Projects
  1. May 11, 2015
    • jerryshao's avatar
      [STREAMING] [MINOR] Close files correctly when iterator is finished in streaming WAL recovery · 25c01c54
      jerryshao authored
      Currently there's no chance to close the file correctly after the iteration is finished, change to `CompletionIterator` to avoid resource leakage.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #6050 from jerryshao/close-file-correctly and squashes the following commits:
      
      52dfaf5 [jerryshao] Close files correctly when iterator is finished
      25c01c54
    • gchen's avatar
      [SPARK-7516] [Minor] [DOC] Replace depreciated inferSchema() with createDataFrame() · 8e674331
      gchen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7516
      
      In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():
      
      schemaPeople = sqlContext.inferSchema(people) ->
      schemaPeople = sqlContext.createDataFrame(people)
      
      Author: gchen <chenguancheng@gmail.com>
      
      Closes #6041 from gchen/python-docs and squashes the following commits:
      
      c27eb7c [gchen] replace inferSchema() with createDataFrame()
      8e674331
    • Kousuke Saruta's avatar
      [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode · 6e9910c2
      Kousuke Saruta authored
      Now PySpark on YARN with cluster mode is supported so let's update doc.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:
      
      ad9f88c [Kousuke Saruta] Brushed up sentences
      469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
      fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
      6e9910c2
    • Steve Loughran's avatar
      [SPARK-7508] JettyUtils-generated servlets to log & report all errors · 7ce2a33c
      Steve Loughran authored
      Patch for SPARK-7508
      
      This logs  warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500.
      
      In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way.
      
      This patch doesn't address that, it ensures that
      1. if the jetty handlers were put to one side the users would see the errors
      2. at least the exceptions appear in the server-side logs.
      
      This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits:
      
      584836f [Steve Loughran] SPARK-7508 drop trailing semicolon
      ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed
      258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed
      69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors
      7ce2a33c
    • Sandy Ryza's avatar
      [SPARK-6470] [YARN] Add support for YARN node labels. · 82fee9d9
      Sandy Ryza authored
      This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #5242 from sryza/sandy-spark-6470 and squashes the following commits:
      
      6af87b9 [Sandy Ryza] Change info to warning
      6e22d99 [Sandy Ryza] [YARN] SPARK-6470.  Add support for YARN node labels.
      82fee9d9
    • Reynold Xin's avatar
      [SPARK-7462] By default retain group by columns in aggregate · 0a4844f9
      Reynold Xin authored
      Updated Java, Scala, Python, and R.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5996 from rxin/groupby-retain and squashes the following commits:
      
      aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
      f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
      5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
      c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit https://github.com/amplab-extras/spark/commit/9a6be746efc9fafad88122fa2267862ef87aa0e1
      b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
      d910141 [Reynold Xin] Updated rest of the files
      1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
      0a4844f9
    • Tathagata Das's avatar
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start... · 1b465569
      Tathagata Das authored
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start multiple StreamingContexts in the same JVM
      
      Currently attempt to start a streamingContext while another one is started throws a confusing exception that the action name JobScheduler is already registered. Instead its best to throw a proper exception as it is not supported.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5907 from tdas/SPARK-7361 and squashes the following commits:
      
      fb81c4a [Tathagata Das] Fix typo
      a9cd5bb [Tathagata Das] Added startSite to StreamingContext
      5fdfc0d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7361
      5870e2b [Tathagata Das] Added check for multiple streaming contexts
      1b465569
    • Bryan Cutler's avatar
      [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option · 4f8a1551
      Bryan Cutler authored
      As is, to specify this option on command line, you have to escape the angle brackets.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
      
      b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
      4f8a1551
    • Yanbo Liang's avatar
      [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlib · 042dda3c
      Yanbo Liang authored
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6044 from yanboliang/spark-6092 and squashes the following commits:
      
      726a9b1 [Yanbo Liang] add newRankingMetrics
      33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
      042dda3c
    • Wesley Miao's avatar
      [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time · d70a0768
      Wesley Miao authored
      tdas
      
      https://issues.apache.org/jira/browse/SPARK-7326
      
      The problem most likely resides in DStream.slice() implementation, as shown below.
      
        def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
          if (!isInitialized) {
            throw new SparkException(this + " has not been initialized")
          }
          if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("toTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
          logInfo("Slicing from " + fromTime + " to " + toTime +
            " (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
      
          alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
            if (time >= zeroTime) getOrCompute(time) else None
          })
        }
      
      Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation.
      
      The fix is to add a new floor() function in Time.scala to respect the zeroTime while performing the floor :
      
        def floor(that: Duration, zeroTime: Time): Time = {
          val t = that.milliseconds
          new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds)
        }
      
      And then change the DStream.slice to call this new floor function by passing in its zeroTime.
      
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
      This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0.
      
      Author: Wesley Miao <wesley.miao@gmail.com>
      Author: Wesley <wesley.miao@autodesk.com>
      
      Closes #5871 from wesleymiao/spark-7326 and squashes the following commits:
      
      82a4d8c [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream dosen't work all the time
      48b4dc0 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      6ade399 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      2611745 [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      d70a0768
    • tianyi's avatar
      [SPARK-7519] [SQL] fix minor bugs in thrift server UI · 2242ab31
      tianyi authored
      Bugs description:
      
      1. There are extra commas on the top of session list.
      2. The format of time in "Start at:" part is not the same as others.
      3. The total number of online sessions is wrong.
      
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #6048 from tianyi/SPARK-7519 and squashes the following commits:
      
      ed366b7 [tianyi] fix bug
      2242ab31
  2. May 10, 2015
    • Shivaram Venkataraman's avatar
      [SPARK-7512] [SPARKR] Fix RDD's show method to use getJRDD · 0835f1ed
      Shivaram Venkataraman authored
      Since the RDD object might be a Pipelined RDD we should use `getJRDD` to get the right handle to the Java object.
      
      Fixes the bug reported at
      http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      
      cc concretevitamin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6035 from shivaram/sparkr-show-bug and squashes the following commits:
      
      d70145c [Shivaram Venkataraman] Fix RDD's show method to use getJRDD Fixes the bug reported at http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      0835f1ed
    • Glenn Weidner's avatar
      [SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python · c5aca0c2
      Glenn Weidner authored
      Modified 2 files:
      python/pyspark/ml/param/_shared_params_code_gen.py
      python/pyspark/ml/param/shared.py
      
      Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
      python _shared_params_code_gen.py > shared.py
      
      Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala.  Note warning was displayed when committing shared.py:
      warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
      
      Author: Glenn Weidner <gweidner@us.ibm.com>
      
      Closes #6023 from gweidner/br-7427 and squashes the following commits:
      
      db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
      c5aca0c2
    • Kirill A. Korinskiy's avatar
      [SPARK-5521] PCA wrapper for easy transform vectors · 8c07c75c
      Kirill A. Korinskiy authored
      I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
      
      Example of usage:
      ```
        import org.apache.spark.mllib.regression.LinearRegressionWithSGD
        import org.apache.spark.mllib.regression.LabeledPoint
        import org.apache.spark.mllib.linalg.Vectors
        import org.apache.spark.mllib.feature.PCA
      
        val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
          val parts = line.split(',')
          LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
        }.cache()
      
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)
      
        val pca = PCA.create(training.first().features.size/2, data.map(_.features))
        val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
        val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
      
        val numIterations = 100
        val model = LinearRegressionWithSGD.train(training, numIterations)
        val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
      
        val valuesAndPreds = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
      
        val valuesAndPreds_pca = test_pca.map { point =>
          val score = model_pca.predict(point.features)
          (score, point.label)
        }
      
        val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
        val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
      
        println("Mean Squared Error = " + MSE)
        println("PCA Mean Squared Error = " + MSE_pca)
      ```
      
      Author: Kirill A. Korinskiy <catap@catap.ru>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4304 from catap/pca and squashes the following commits:
      
      501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
      9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
      1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
      8c07c75c
    • Joseph K. Bradley's avatar
      [SPARK-7431] [ML] [PYTHON] Made CrossValidatorModel call parent init in PySpark · 3038443e
      Joseph K. Bradley authored
      Fixes bug with PySpark cvModel not having UID
      Also made small PySpark fixes: Evaluator should inherit from Params.  MockModel should inherit from Model.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
      
      57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
      3038443e
    • Cheng Lian's avatar
      [MINOR] [SQL] Fixes variable name typo · 6bf9352f
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6038)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6038 from liancheng/fix-typo and squashes the following commits:
      
      572c2a4 [Cheng Lian] Fixes variable name typo
      6bf9352f
    • Oleg Sidorkin's avatar
      [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector · d7a37bca
      Oleg Sidorkin authored
      Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns.
      If original column is used in SQL query once the resulting DataFrame will contain non-renamed column.
      If original column is used in SQL query several times with different aliases, sqlContext.load will fail.
      Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema.
      Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement.
      
      Readings:
      http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnname
      http://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2
      
      Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement.
      http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
      getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method.
      
      Author: Oleg Sidorkin <oleg.sidorkin@gmail.com>
      
      Closes #6032 from osidorkin/master and squashes the following commits:
      
      10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error)
      2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query)
      b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite
      09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector
      d7a37bca
    • Yanbo Liang's avatar
      [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlib · bf7e81a5
      Yanbo Liang authored
      https://issues.apache.org/jira/browse/SPARK-6091
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6011 from yanboliang/spark-6091 and squashes the following commits:
      
      bb3e4ba [Yanbo Liang] trigger jenkins
      53c045d [Yanbo Liang] keep compatibility for python 2.6
      972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib
      bf7e81a5
  3. May 09, 2015
    • Yuhao Yang's avatar
      [SPARK-7475] [MLLIB] adjust ldaExample for online LDA · b13162b3
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7475
      
      Add a new argument to specify the algorithm applied to LDA, to exhibit the basic usage of LDAOptimizer.
      
      cc jkbradley
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6000 from hhbyyh/ldaExample and squashes the following commits:
      
      0a7e2bc [Yuhao Yang] fix according to comments
      5810b0f [Yuhao Yang] adjust ldaExample for online LDA
      b13162b3
    • tedyu's avatar
      [BUILD] Reference fasterxml.jackson.version in sql/core/pom.xml · bd74301f
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6031 from tedyu/master and squashes the following commits:
      
      5c2580c [tedyu] Reference fasterxml.jackson.version in sql/core/pom.xml
      ff2a44f [tedyu] Merge branch 'master' of github.com:apache/spark
      28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
      bd74301f
    • tedyu's avatar
      Upgrade version of jackson-databind in sql/core/pom.xml · 3071aac3
      tedyu authored
      Currently version of jackson-databind in sql/core/pom.xml is 2.3.0
      
      This is older than the version specified in root pom.xml
      
      This PR upgrades the version in sql/core/pom.xml so that they're consistent.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6028 from tedyu/master and squashes the following commits:
      
      28c8394 [tedyu] Upgrade version of jackson-databind in sql/core/pom.xml
      3071aac3
    • dobashim's avatar
      [STREAMING] [DOCS] Fix wrong url about API docs of StreamingListener · 7d0f1720
      dobashim authored
      A little fix about wrong url of the API document. (org.apache.spark.streaming.scheduler.StreamingListener)
      
      Author: dobashim <dobashim@oss.nttdata.co.jp>
      
      Closes #6024 from dobashim/master and squashes the following commits:
      
      ac9a955 [dobashim] [STREAMING][DOCS] Fix wrong url about API docs of StreamingListener
      7d0f1720
    • Kousuke Saruta's avatar
      [SPARK-7403] [WEBUI] Link URL in objects on Timeline View is wrong in case of running on YARN · 12b95abc
      Kousuke Saruta authored
      When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong.
      
      In timeline-view.js, the link is generated as follows.
      ```
      window.location.href = "job/?id=" + getJobId(this);
      ```
      
      This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/"
      
      The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages.
      
      ![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png)
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5947 from sarutak/fix-link-in-timeline and squashes the following commits:
      
      aaf40e1 [Kousuke Saruta] Added Copyright for vis.js
      01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href
      12b95abc
    • Vinod K C's avatar
      [SPARK-7438] [SPARK CORE] Fixed validation of relativeSD in countApproxDistinct · dda6d9f4
      Vinod K C authored
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
      
      3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
      799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
      8ddbfae [Vinod K C] Remove blank line
      b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
      122d378 [Vinod K C] Fixed validation of relativeSD in  countApproxDistinct
      dda6d9f4
  4. May 08, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7498] [ML] removed varargs annotation from Params.setDefaults · 29926238
      Joseph K. Bradley authored
      In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me.
      However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6021 from jkbradley/revert-varargs and squashes the following commits:
      
      098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs
      29926238
    • DB Tsai's avatar
      [SPARK-7262] [ML] Binary LogisticRegression with L1/L2 (elastic net) using OWLQN in new ML package · 86ef4cfd
      DB Tsai authored
      1) Handle scaling and addBias internally.
      2) L1/L2 elasticnet using OWLQN optimizer.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5967 from dbtsai/lor and squashes the following commits:
      
      fa029bb [DB Tsai] made the bound smaller
      0806002 [DB Tsai] better initial intercept and more test
      5c31824 [DB Tsai] fix import
      c387e25 [DB Tsai] Merge branch 'master' into lor
      c84e931 [DB Tsai] Made MultiClassSummarizer private
      f98e711 [DB Tsai] address feedback
      a784321 [DB Tsai] fix style
      8ec65d2 [DB Tsai] remove new line
      f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug
      34705bc [DB Tsai] first commit
      86ef4cfd
    • Josh Rosen's avatar
      [SPARK-7375] [SQL] Avoid row copying in exchange when sort.serializeMapOutputs takes effect · cde54838
      Josh Rosen authored
      This patch refactors the SQL `Exchange` operator's logic for determining whether map outputs need to be copied before being shuffled. As part of this change, we'll now avoid unnecessary copies in cases where sort-based shuffle operates on serialized map outputs (as in #4450 /
      SPARK-4550).
      
      This patch also includes a change to copy the input to RangePartitioner partition bounds calculation, which is necessary because this calculation buffers mutable Java objects.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5948)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5948 from JoshRosen/SPARK-7375 and squashes the following commits:
      
      f305ff3 [Josh Rosen] Reduce scope of some variables in Exchange
      899e1d7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-7375
      6a6bfce [Josh Rosen] Fix issue related to RangePartitioning:
      ad006a4 [Josh Rosen] [SPARK-7375] Avoid defensive copying in exchange operator when sort.serializeMapOutputs takes effect.
      cde54838
    • Shivaram Venkataraman's avatar
      [SPARK-7231] [SPARKR] Changes to make SparkR DataFrame dplyr friendly. · 0a901dd3
      Shivaram Venkataraman authored
      Changes include
      1. Rename sortDF to arrange
      2. Add new aliases `group_by` and `sample_frac`, `summarize`
      3. Add more user friendly column addition (mutate), rename
      4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
      
      Using these changes we can pretty much run the examples as described in http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html with the same syntax
      
      The only thing missing in SparkR is auto resolving column names when used in an expression i.e. making something like `select(flights, delay)` works in dply but we right now need `select(flights, flights$delay)` or `select(flights, "delay")`. But this is a complicated change and I'll file a new issue for it
      
      cc sun-rui rxin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6005 from shivaram/sparkr-df-api and squashes the following commits:
      
      5e0716a [Shivaram Venkataraman] Fix some roxygen bugs
      1254953 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into sparkr-df-api
      0521149 [Shivaram Venkataraman] Changes to make SparkR DataFrame dplyr friendly. Changes include 1. Rename sortDF to arrange 2. Add new aliases `group_by` and `sample_frac`, `summarize` 3. Add more user friendly column addition (mutate), rename 4. Support mean as an alias for avg in Scala and also support n_distinct, n as in dplyr
      0a901dd3
    • Ashwin Shankar's avatar
      [SPARK-7451] [YARN] Preemption of executors is counted as failure causing Spark job to fail · b6c797b0
      Ashwin Shankar authored
      Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
      andrewor14
      
      Author: Ashwin Shankar <ashankar@netflix.com>
      
      Closes #5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:
      
      90900cf [Ashwin Shankar] Fix log info message
      cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
      b6c797b0
    • Burak Yavuz's avatar
      [SPARK-7488] [ML] Feature Parity in PySpark for ml.recommendation · 84bf931f
      Burak Yavuz authored
      Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6015 from brkyvz/ml-rec and squashes the following commits:
      
      be6e931 [Burak Yavuz] addressed comments
      eaed879 [Burak Yavuz] readd numFeatures
      0bd66b1 [Burak Yavuz] fixed seed
      7f6d964 [Burak Yavuz] merged master
      52e2bda [Burak Yavuz] added ALS
      84bf931f
    • tedyu's avatar
      [SPARK-7237] Clean function in several RDD methods · 54e6fa05
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #5959 from ted-yu/master and squashes the following commits:
      
      f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex
      56d7c92 [tedyu] Consolidate import of Random
      f6014c0 [tedyu] Remove cleaning in RDD#filterWith
      36feb6c [tedyu] Try to get correct syntax
      55d01eb [tedyu] Try to get correct syntax
      c2786df [tedyu] Correct syntax
      d92bfcf [tedyu] Correct syntax in test
      164d3e4 [tedyu] Correct variable name
      8b50d93 [tedyu] Address Andrew's review comments
      0c8d47e [tedyu] Add test for mapWith()
      6846e40 [tedyu] Add test for flatMapWith()
      6c124a9 [tedyu] Clean function in several RDD methods
      54e6fa05
    • Andrew Or's avatar
      [SPARK-7469] [SQL] DAG visualization: show SQL query operators · bd61f070
      Andrew Or authored
      The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus
      
      -----------------
      **Before**
      <img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
      -----------------
      **After** (Pay attention to the words)
      <img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
      -----------------
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5999 from andrewor14/dag-viz-sql and squashes the following commits:
      
      0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
      1e211db [Andrew Or] Update comment
      0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
      ffd237a [Andrew Or] Fix style
      202dac1 [Andrew Or] Make ignoreParent false by default
      e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
      569034a [Andrew Or] Add a flag to ignore parent settings and scopes
      bd61f070
    • Aaron Davidson's avatar
      [SPARK-6955] Perform port retries at NettyBlockTransferService level · ffdc40ce
      Aaron Davidson authored
      Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #5575 from aarondav/port-bind and squashes the following commits:
      
      3c2d6ed [Aaron Davidson] Oops, never do it.
      a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
      e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
      59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
      ffdc40ce
    • Brendan Collins's avatar
      updated ec2 instance types · 1c78f686
      Brendan Collins authored
      I needed to run some d2 instances, so I updated the spark_ec2.py accordingly
      
      Author: Brendan Collins <bcollins@blueraster.com>
      
      Closes #6014 from brendancol/ec2-instance-types-update and squashes the following commits:
      
      d7b4191 [Brendan Collins] Merge branch 'ec2-instance-types-update' of github.com:brendancol/spark into ec2-instance-types-update
      6366c45 [Brendan Collins] added back cc1.4xlarge
      fc2931f [Brendan Collins] updated ec2 instance types
      80c2aa6 [Brendan Collins] vertically aligned whitespace
      85c6236 [Brendan Collins] vertically aligned whitespace
      1657c26 [Brendan Collins] updated ec2 instance types
      1c78f686
    • Yanbo Liang's avatar
      [SPARK-5913] [MLLIB] Python API for ChiSqSelector · 35c9599b
      Yanbo Liang authored
      Add a Python API for mllib.feature.ChiSqSelector
      https://issues.apache.org/jira/browse/SPARK-5913
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5939 from yanboliang/spark-5913 and squashes the following commits:
      
      cdaac99 [Yanbo Liang] Python API for ChiSqSelector
      35c9599b
    • Jacky Li's avatar
      [SPARK-4699] [SQL] Make caseSensitive configurable in spark sql analyzer · 6dad76e5
      Jacky Li authored
      based on #3558
      
      Author: Jacky Li <jacky.likun@huawei.com>
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5806 from scwf/case and squashes the following commits:
      
      cd51712 [wangfei] fix compile
      d4b724f [wangfei] address michael's comment
      af512c7 [wangfei] fix conflicts
      4ef1be7 [wangfei] fix conflicts
      269cf21 [scwf] fix conflicts
      b73df6c [scwf] style issue
      9e11752 [scwf] improve SimpleCatalystConf
      b35529e [scwf] minor style
      a3f7659 [scwf] remove unsed imports
      2a56515 [scwf] fix conflicts
      6db4bf5 [scwf] also fix for HiveContext
      7fc4a98 [scwf] fix test case
      d5a9933 [wangfei] fix style
      eee75ba [wangfei] fix EmptyConf
      6ef31cf [wangfei] revert pom changes
      5d7c456 [wangfei] set CASE_SENSITIVE false in TestHive
      966e719 [wangfei] set CASE_SENSITIVE false in hivecontext
      fd30e25 [wangfei] added override
      69b3b70 [wangfei] fix AnalysisSuite
      5472b08 [wangfei] fix compile issue
      56034ca [wangfei] fix conflicts and improve for catalystconf
      664d1e9 [Jacky Li] Merge branch 'master' of https://github.com/apache/spark into case
      12eca9a [Jacky Li] solve conflict with master
      39e369c [Jacky Li] fix confilct after DataFrame PR
      dee56e9 [Jacky Li] fix test case failure
      05b09a3 [Jacky Li] fix conflict base on the latest master branch
      73c16b1 [Jacky Li] fix bug in sql/hive
      9bf4cc7 [Jacky Li] fix bug in catalyst
      005c56d [Jacky Li] make SQLContext caseSensitivity configurable
      6332e0f [Jacky Li] fix bug
      fcbf0d9 [Jacky Li] fix scalastyle check
      e7bca31 [Jacky Li] make caseSensitive configuration in Analyzer and Catalog
      91b1b96 [Jacky Li] make caseSensitive configurable in Analyzer
      f57f15c [Jacky Li] add testcase
      578d167 [Jacky Li] make caseSensitive configurable
      6dad76e5
    • Liang-Chi Hsieh's avatar
      [SPARK-7390] [SQL] Only merge other CovarianceCounter when its count is greater than zero · 90527f56
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7390
      
      Also fix a minor typo.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5931 from viirya/fix_covariancecounter and squashes the following commits:
      
      352eda6 [Liang-Chi Hsieh] Only merge other CovarianceCounter when its count is greater than zero.
      90527f56
    • Marcelo Vanzin's avatar
      [SPARK-7378] [CORE] Handle deep links to unloaded apps. · 5467c34c
      Marcelo Vanzin authored
      The code was treating deep links as if they were attempt IDs, so
      for example if you tried to load "/history/app1/jobs" directly,
      that would fail because the code would treat "jobs" as an attempt id.
      
      This change modifies the code to try both cases - first without an
      attempt id, then with it, so that deep links are handled correctly.
      This assumes that the links in the Spark UI do not clash with the
      attempt id namespace, though, which is the case for YARN at least,
      which is the only backend that currently publishes attempt IDs.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5922 from vanzin/SPARK-7378 and squashes the following commits:
      
      96f648b [Marcelo Vanzin] Fix comparison.
      ed3bcd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-7378
      23483e4 [Marcelo Vanzin] Fat fingers.
      b728f08 [Marcelo Vanzin] [SPARK-7378] [core] Handle deep links to unloaded apps.
      5467c34c
    • Marcelo Vanzin's avatar
      [MINOR] [CORE] Allow History Server to read kerberos opts from config file. · 9042f8f3
      Marcelo Vanzin authored
      Order of initialization code was wrong.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5998 from vanzin/hs-conf-fix and squashes the following commits:
      
      00b6b6b [Marcelo Vanzin] [minor] [core] Allow History Server to read kerberos opts from config file.
      9042f8f3
    • Andrew Or's avatar
      [SPARK-7466] DAG visualization: fix orphan nodes · 3b0c5e71
      Andrew Or authored
      Simple fix. We were comparing an option with `null`.
      
      Before:
      <img src="https://issues.apache.org/jira/secure/attachment/12731383/before.png" width="250px"/>
      After:
      <img src="https://issues.apache.org/jira/secure/attachment/12731384/after.png" width="250px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6002 from andrewor14/dag-viz-orphan-nodes and squashes the following commits:
      
      a1468dc [Andrew Or] Fix null check
      3b0c5e71
Loading