Skip to content
Snippets Groups Projects
  1. Feb 03, 2015
    • Daoyuan Wang's avatar
      [SPARK-4987] [SQL] parquet timestamp type support · 0c20ce69
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits:
      
      b1e2a0d [Daoyuan Wang] fix for nanos
      4dadef1 [Daoyuan Wang] fix wrong read
      93f438d [Daoyuan Wang] parquet timestamp support
      0c20ce69
    • Reynold Xin's avatar
      [SQL] DataFrame API update · 4204a127
      Reynold Xin authored
      1. Added Java-friendly version of the expression operators (i.e. gt, geq)
      2. Added JavaDoc for most operators
      3. Simplified expression operators by having only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
      4. agg function now accepts varargs of (String, String).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4332 from rxin/df-update and squashes the following commits:
      
      ab0aa69 [Reynold Xin] Added Java friendly expression methods. Added JavaDoc. For each expression operator, have only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column.
      576d07a [Reynold Xin] random commit.
      4204a127
    • Reynold Xin's avatar
      Minor: Fix TaskContext deprecated annotations. · f7948f3f
      Reynold Xin authored
      Made a mistake in https://github.com/apache/spark/pull/4324
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4333 from rxin/taskcontext-deprecate and squashes the following commits:
      
      61c44ee [Reynold Xin] Minor: Fix TaskContext deprecated annotations.
      f7948f3f
    • Reynold Xin's avatar
      [SPARK-5549] Define TaskContext interface in Scala. · bebf4c42
      Reynold Xin authored
      So the interface documentation shows up in ScalaDoc.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4324 from rxin/TaskContext-scala and squashes the following commits:
      
      2480a17 [Reynold Xin] comment
      573756f [Reynold Xin] style fixes and javadoc fixes.
      87dd537 [Reynold Xin] [SPARK-5549] Define TaskContext interface in Scala.
      bebf4c42
    • Reynold Xin's avatar
      [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility · 523a9352
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4327 from rxin/schemarddTypeAlias and squashes the following commits:
      
      e5a8ff3 [Reynold Xin] [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility
      523a9352
    • Reynold Xin's avatar
      [SQL][DataFrame] Remove DataFrameApi, ExpressionApi, and GroupedDataFrameApi · 37df3301
      Reynold Xin authored
      They were there mostly for code review and easier check of the API. I don't think they need to be there anymore.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4328 from rxin/remove-df-api and squashes the following commits:
      
      723d600 [Reynold Xin] [SQL][DataFrame] Remove DataFrameApi and ColumnApi.
      37df3301
    • Xiangrui Meng's avatar
      [minor] update streaming linear algorithms · 659329f9
      Xiangrui Meng authored
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4329 from mengxr/streaming-lr and squashes the following commits:
      
      78731e1 [Xiangrui Meng] update streaming linear algorithms
      659329f9
    • Joseph K. Bradley's avatar
      [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM · 980764f3
      Joseph K. Bradley authored
      **This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**
      
      The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it.  In particular, see the API and Planning for the Future sections.
      
      * Settle on a public API which may eventually include:
        * more inference algorithms
        * more options / functionality
      * Have an initial easy-to-understand implementation which others may improve.
      * This is NOT intended to support every topic model out there.  However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
      * This may not be very scalable currently.  It will be important to check and improve accuracy.  For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.
      
      **Dependency: This makes MLlib depend on GraphX.**
      
      Files and classes:
      * LDA.scala (441 lines):
        * class LDA (main estimator class)
        * LDA.Document  (text + document ID)
      * LDAModel.scala (266 lines)
        * abstract class LDAModel
        * class LocalLDAModel
        * class DistributedLDAModel
      * LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
      * LDASuite.scala (144 lines)
      
      Data/model representation and algorithm:
      * Data/model: Uses GraphX, with term vertices + document vertices
      * Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh.  "On Smoothing and Inference for Topic Models."  UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
      * For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala
      
      Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
      
      Here, I list the main changes AFTER the design doc was posted.
      
      Design decisions:
      * logLikelihood() computes the log likelihood of the data and the current point estimate of parameters.  This is different from the likelihood of the data given the hyperparameters, which would be harder to compute.  I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
      * The current API takes Documents as token count vectors.  I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR.  I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
      * Hyperparameters should be set differently for different inference/learning algorithms.  See Asuncion et al. (2009) in the design doc for a good demonstration.  I encourage good behavior via defaults and warning messages.
      
      Items planned for future PRs:
      * perplexity
      * API taking Strings
      
      * Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
        * Pro: We may someday want LinearDiscriminantAnalysis.
        * Con: Very long names
      
      * Should LDA reside in clustering?  Or do we want a sub-package?
        * mllib.topicmodel
        * mllib.clustering.topicmodel
      
      * Does the API seem reasonable and extensible?
      
      * Unit tests:
        * Should there be a test which checks a clustering results?  E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters.  Does that sound useful or too flaky?
      
      This has not been tested much for scaling.  I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics.  Running it for 500 iterations made it fail because of GC problems.  I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.
      
      * dlwh  for the initial implementation
        * + jegonzal  for some code in the initial implementation
      * The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal  IlyaKozlov
        * Note: The plan is to include this full list in the authors if this PR gets merged.  Please notify me if you prefer otherwise.
      
      CC: mengxr
      
      Authors:
        Joseph K. Bradley <joseph@databricks.com>
        Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
        David Hall <david.lw.hall@gmail.com>
        Guoqiang Li <witgo@qq.com>
        Xiangrui Meng <meng@databricks.com>
        Pedro Rodriguez <pedro@snowgeek.org>
        Avanesov Valeriy <acopich@gmail.com>
        Xusen Yin <yinxusen@gmail.com>
      
      Closes #2388
      Closes #4047 from jkbradley/davidhall-lda and squashes the following commits:
      
      77e8814 [Joseph K. Bradley] small doc fix
      5c74345 [Joseph K. Bradley] cleaned up doc based on code review
      589728b [Joseph K. Bradley] Updates per code review.  Main change was in LDAExample for faster vocab computation.  Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
      e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
      74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
      4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
      e391474 [Joseph K. Bradley] Removed LDATiming.  Added PeriodicGraphCheckpointerSuite.scala.  Small LDA cleanups.
      e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception.  Improved preprocessing to reduce passes over data
      1a231b4 [Joseph K. Bradley] fixed scalastyle
      91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
      b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample.  Will remove LDATiming after done testing
      993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration   * Also added aliases alpha, beta
      cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
      43c1c40 [Joseph K. Bradley] small cleanup
      0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
      77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods.  Changed word to term in some places.  Updated LDAExample to use default smoothing amounts.
      fb1e7b5 [Xiangrui Meng] minor
      08d59a3 [Xiangrui Meng] reset spacing
      9fe0b95 [Xiangrui Meng] optimize aggregateMessages
      cec0a9c [Xiangrui Meng] * -> *=
      6cb11b0 [Xiangrui Meng] optimize computePTopic
      9eb3d02 [Xiangrui Meng] + -> +=
      892530c [Xiangrui Meng] use axpy
      45cc7f2 [Xiangrui Meng] mapPart -> flatMap
      ce53be9 [Joseph K. Bradley] fixed example name
      75749e7 [Joseph K. Bradley] scala style fix
      9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
      377ebd9 [Joseph K. Bradley] separated LDA models into own file.  more cleanups before PR
      2d40006 [Joseph K. Bradley] cleanups before PR
      2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
      0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
      980764f3
    • Xiangrui Meng's avatar
      [SPARK-5536] replace old ALS implementation by the new one · 0cc7b88c
      Xiangrui Meng authored
      The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
      
      CC: srowen coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:
      
      5a3cee8 [Xiangrui Meng] update python tests that are too strict
      e840acf [Xiangrui Meng] ignore scala style check for ALS.train
      e9a721c [Xiangrui Meng] update mima excludes
      9ee6a36 [Xiangrui Meng] merge master
      9a8aeac [Xiangrui Meng] update tests
      d8c3271 [Xiangrui Meng] remove analyzeBlocks
      d68eee7 [Xiangrui Meng] add checkpoint to new ALS
      22a56f8 [Xiangrui Meng] wrap old ALS
      c387dff [Xiangrui Meng] support random seed
      3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
      0cc7b88c
    • Josh Rosen's avatar
      [SPARK-5414] Add SparkFirehoseListener class for consuming all SparkListener events · b8ebebea
      Josh Rosen authored
      There isn't a good way to write a SparkListener that receives all SparkListener events and which will be future-compatible (e.g. it will receive events introduced in newer versions of Spark without having to override new methods to process those events).
      
      To address this, this patch adds `SparkFirehoseListener`, a SparkListener implementation that receives all events and dispatches them to a single `onEvent` method (which can be overridden by users).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4210 from JoshRosen/firehose-listener and squashes the following commits:
      
      223f579 [Josh Rosen] Expand comment to explain rationale for this being a Java class.
      ecdfaed [Josh Rosen] Add SparkFirehoseListener class for consuming all SparkListener events.
      b8ebebea
    • Yin Huai's avatar
      [SPARK-5501][SPARK-5420][SQL] Write support for the data source API · 13531dd9
      Yin Huai authored
      This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported).
      
      In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows.
      * For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now.
      * For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ...
        * when `IF NOT EXISTS` clause is used, we will do nothing.
        * when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists.
      
      TODOs:
      - [x] CTAS support
      - [x] Programmatic APIs
      - [ ] Python API (another PR)
      - [x] More unit tests
      - [ ] Documents (another PR)
      
      marmbrus liancheng rxin
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4294 from yhuai/writeSupport and squashes the following commits:
      
      3db1539 [Yin Huai] save does not take overwrite.
      1c98881 [Yin Huai] Fix test.
      142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
      34e1bfb [Yin Huai] Address comments.
      1682ca6 [Yin Huai] Better support for CTAS statements.
      e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options.
      0128065 [Yin Huai] Short hand versions of save and load.
      66ebd74 [Yin Huai] Formatting.
      9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport
      e5d29f2 [Yin Huai] Programmatic APIs.
      1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now.
      909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame.
      95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement.
      d37b19c [Yin Huai] Cheng's comments.
      fd6758c [Yin Huai] Use BeforeAndAfterAll.
      7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause.
      cb85b05 [Yin Huai] Initial write support.
      2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.
      13531dd9
    • FlytxtRnD's avatar
      [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model · 50a1a874
      FlytxtRnD authored
      Python API for the Gaussian Mixture Model clustering algorithm in MLLib.
      
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:
      
      c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
      fa0a142 [FlytxtRnD] New line added
      d5b36ab [FlytxtRnD] Changed argument names to lowercase
      ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
      6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
      3aee84b [FlytxtRnD] Fixed style issues
      2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
      b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
      05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
      426d130 [FlytxtRnD] Added random seed parameter
      332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
      f82750b [FlytxtRnD] Fixed style issues
      5c83825 [FlytxtRnD] Split input file with space delimiter
      fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
      50a1a874
    • Thomas Graves's avatar
      [SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs · c31c36c4
      Thomas Graves authored
      .this was https://github.com/apache/spark/pull/2676
      
      https://issues.apache.org/jira/browse/SPARK-3778
      
      This affects if someone is trying to access secure hdfs something like:
      val lines = {
      val hconf = new Configuration()
      hconf.set("mapred.input.dir", "mydir")
      hconf.set("textinputformat.record.delimiter","\003432\n")
      sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
      }
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #4292 from tgravescs/SPARK-3788 and squashes the following commits:
      
      cf3b453 [Thomas Graves] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
      c31c36c4
    • freeman's avatar
      [SPARK-4979][MLLIB] Streaming logisitic regression · eb0da6c4
      freeman authored
      This adds support for streaming logistic regression with stochastic gradient descent, in the same manner as the existing implementation of streaming linear regression. It is a relatively simple addition because most of the work is already done by the abstract class `StreamingLinearAlgorithm` and existing algorithms and models from MLlib.
      
      The PR includes
      - Streaming Logistic Regression algorithm
      - Unit tests for accuracy, streaming convergence, and streaming prediction
      - An example use
      
      cc mengxr tdas
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #4306 from freeman-lab/streaming-logisitic-regression and squashes the following commits:
      
      5c2c70b [freeman] Use Option on model
      5cca2bc [freeman] Merge remote-tracking branch 'upstream/master' into streaming-logisitic-regression
      275f8bd [freeman] Make private to mllib
      3926e4e [freeman] Line formatting
      5ee8694 [freeman] Experimental tag for docs
      2fc68ac [freeman] Fix example formatting
      85320b1 [freeman] Fixed line length
      d88f717 [freeman] Remove stray comment
      59d7ecb [freeman] Add streaming logistic regression
      e78fe28 [freeman] Add streaming logistic regression example
      321cc66 [freeman] Set private and protected within mllib
      eb0da6c4
  2. Feb 02, 2015
    • zsxwing's avatar
      [SPARK-5219][Core] Add locks to avoid scheduling race conditions · c306555f
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4019 from zsxwing/SPARK-5219 and squashes the following commits:
      
      36a8b4e [zsxwing] Add locks to avoid race conditions
      c306555f
    • Cheng Lian's avatar
      [Doc] Minor: Fixes several formatting issues · 60f67e7a
      Cheng Lian authored
      Fixes several minor formatting issues in the [Continuous Compilation] [1] section.
      
      [1]: http://spark.apache.org/docs/latest/building-spark.html#continuous-compilation
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4316)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4316 from liancheng/fix-build-instruction-docs and squashes the following commits:
      
      0a92e01 [Cheng Lian] Fixes several formatting issues
      60f67e7a
    • Patrick Wendell's avatar
      SPARK-3996: Add jetty servlet and continuations. · 7930d2be
      Patrick Wendell authored
      These are needed transitively from the other Jetty libraries
      we include. It was not picked up by unit tests because we
      disable the UI.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #4323 from pwendell/jetty and squashes the following commits:
      
      d8669da [Patrick Wendell] SPARK-3996: Add jetty servlet and continuations.
      7930d2be
    • Patrick Wendell's avatar
      SPARK-5542: Decouple publishing, packaging, and tagging in release script · 0ef38f5f
      Patrick Wendell authored
      These are some changes to the build script to allow parts of it to be run independently. This has already been tested during the 1.2.1 release cycle.
      
      Author: Patrick Wendell <patrick@databricks.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #4319 from pwendell/release-updates and squashes the following commits:
      
      dfe7ed9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into release-updates
      478b072 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into release-updates
      126dd0c [Patrick Wendell] Allow decoupling Maven publishing from cutting release
      0ef38f5f
    • nemccarthy's avatar
      [SPARK-5543][WebUI] Remove unused import JsonUtil from from JsonProtocol · cb39f120
      nemccarthy authored
      Simple PR to Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
      This import is unused. It was introduced in PR #4029 https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231
      
      Author: nemccarthy <nathan@nemccarthy.me>
      
      Closes #4320 from nemccarthy/master and squashes the following commits:
      
      8e34a11 [nemccarthy] [SPARK-5543][WebUI] Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core
      cb39f120
    • Tor Myklebust's avatar
      [SPARK-5472][SQL] A JDBC data source for Spark SQL. · 8f471a66
      Tor Myklebust authored
      This pull request contains a Spark SQL data source that can pull data from, and can put data into, a JDBC database.
      
      I have tested both read and write support with H2, MySQL, and Postgres.  It would surprise me if both read and write support worked flawlessly out-of-the-box for any other database; different databases have different names for different JDBC data types and different meanings for SQL types with the same name.  However, this code is designed (see `DriverQuirks.scala`) to make it *relatively* painless to add support for another database by augmenting the type mapping contained in this PR.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      
      Closes #4261 from tmyklebu/master and squashes the following commits:
      
      cf167ce [Tor Myklebust] Work around other Java tests ruining TestSQLContext.
      67893bf [Tor Myklebust] Move the jdbcRDD methods into SQLContext itself.
      585f95b [Tor Myklebust] Dependencies go into the project's pom.xml.
      829d5ba [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      41647ef [Tor Myklebust] Hide a couple things that don't need to be public.
      7318aea [Tor Myklebust] Fix scalastyle warnings.
      a09eeac [Tor Myklebust] JDBC data source for Spark SQL.
      176bb98 [Tor Myklebust] Add test deps for JDBC support.
      8f471a66
    • Liang-Chi Hsieh's avatar
      [SPARK-5512][Mllib] Run the PIC algorithm with initial vector suggected by the PIC paper · 1bcd4657
      Liang-Chi Hsieh authored
      As suggested by the paper of Power Iteration Clustering, it is useful to set the initial vector v0 as the degree vector d. This pr tries to add a running method for that.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4301 from viirya/pic_degreevector and squashes the following commits:
      
      7db28fb [Liang-Chi Hsieh] Refactor it to address comments.
      19cf94e [Liang-Chi Hsieh] Add an option to select initialization method.
      ec88567 [Liang-Chi Hsieh] Run the PIC algorithm with degree vector d as suggected by the PIC paper.
      1bcd4657
    • Davies Liu's avatar
      [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
      Davies Liu authored
      This PR brings the Python API for Spark Streaming Kafka data source.
      
      ```
          class KafkaUtils(__builtin__.object)
           |  Static methods defined here:
           |
           |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
      2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
           |      Create an input stream that pulls messages from a Kafka Broker.
           |
           |      :param ssc:  StreamingContext object
           |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
           |      :param groupId:  The group id for this consumer.
           |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
           |                      Each partition is consumed in its own thread.
           |      :param storageLevel:  RDD storage level.
           |      :param keyDecoder:  A function used to decode key
           |      :param valueDecoder:  A function used to decode value
           |      :return: A DStream object
      ```
      run the example:
      
      ```
      bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Tathagata Das <tdas@databricks.com>
      
      Closes #3715 from davies/kafka and squashes the following commits:
      
      d93bfe0 [Davies Liu] Update make-distribution.sh
      4280d04 [Davies Liu] address comments
      e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      f257071 [Davies Liu] add tests for null in RDD
      23b039a [Davies Liu] address comments
      9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
      a74da87 [Davies Liu] address comments
      dc1eed0 [Davies Liu] Update kafka_wordcount.py
      31e2317 [Davies Liu] Update kafka_wordcount.py
      370ba61 [Davies Liu] Update kafka.py
      97386b3 [Davies Liu] address comment
      2c567a5 [Davies Liu] update logging and comment
      33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
      aea8953 [Tathagata Das] Kafka-assembly for Python API
      eea16a7 [Davies Liu] refactor
      f6ce899 [Davies Liu] add example and fix bugs
      98c8d17 [Davies Liu] fix python style
      5697a01 [Davies Liu] bypass decoder in scala
      048dbe6 [Davies Liu] fix python style
      75d485e [Davies Liu] add mqtt
      07923c4 [Davies Liu] support kafka in Python
      0561c454
    • Reynold Xin's avatar
      [SQL] Improve DataFrame API error reporting · 554403fd
      Reynold Xin authored
      1. Throw UnsupportedOperationException if a Column is not computable.
      2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4296 from rxin/col-computability and squashes the following commits:
      
      6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
      fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
      f79034c [Davies Liu] fix python tests
      5afe1ff [Reynold Xin] Fix scala test.
      17f6bae [Reynold Xin] Various fixes.
      b932e86 [Reynold Xin] Added eager analysis for error reporting.
      e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
      554403fd
    • Patrick Wendell's avatar
    • Jacek Lewandowski's avatar
      Spark 3883: SSL support for HttpServer and Akka · cfea3003
      Jacek Lewandowski authored
      SPARK-3883: SSL support for Akka connections and Jetty based file servers.
      
      This story introduced the following changes:
      - Introduced SSLOptions object which holds the SSL configuration and can build the appropriate configuration for Akka or Jetty. SSLOptions can be created by parsing SparkConf entries at a specified namespace.
      - SSLOptions is created and kept by SecurityManager
      - All Akka actor address creation snippets based on interpolated strings were replaced by a dedicated methods from AkkaUtils. Those methods select the proper Akka protocol - whether akka.tcp or akka.ssl.tcp
      - Added tests cases for AkkaUtils, FileServer, SSLOptions and SecurityManager
      - Added a way to use node local SSL configuration by executors and driver in standalone mode. It can be done by specifying spark.ssl.useNodeLocalConf in SparkConf.
      - Made CoarseGrainedExecutorBackend not overwrite the settings which are executor startup configuration - they are passed anyway from Worker
      
      Refer to https://github.com/apache/spark/pull/3571 for discussion and details
      
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      Author: Jacek Lewandowski <jacek.lewandowski@datastax.com>
      
      Closes #3571 from jacek-lewandowski/SPARK-3883-master and squashes the following commits:
      
      9ef4ed1 [Jacek Lewandowski] Merge pull request #2 from jacek-lewandowski/SPARK-3883-docs2
      fb31b49 [Jacek Lewandowski] SPARK-3883: Added SSL setup documentation
      2532668 [Jacek Lewandowski] SPARK-3883: Refactored AkkaUtils.protocol method to not use Try
      90a8762 [Jacek Lewandowski] SPARK-3883: Refactored methods to resolve Akka address and made it possible to easily configure multiple communication layers for SSL
      72b2541 [Jacek Lewandowski] SPARK-3883: A reference to the fallback SSLOptions can be provided when constructing SSLOptions
      93050f4 [Jacek Lewandowski] SPARK-3883: SSL support for HttpServer and Akka
      cfea3003
    • Xiangrui Meng's avatar
      [SPARK-5540] hide ALS.solveLeastSquares · ef65cf09
      Xiangrui Meng authored
      This method survived the code review and it has been there since v1.1.0. It exposes jblas types. Let's remove it from the public API. I think no one calls it directly.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4318 from mengxr/SPARK-5540 and squashes the following commits:
      
      586ade6 [Xiangrui Meng] hide ALS.solveLeastSquares
      ef65cf09
    • Joseph K. Bradley's avatar
      [SPARK-5534] [graphx] Graph getStorageLevel fix · f133dece
      Joseph K. Bradley authored
      This fixes getStorageLevel for EdgeRDDImpl and VertexRDDImpl (and therefore for Graph).
      
      See code example on JIRA which failed before but works with this patch: [https://issues.apache.org/jira/browse/SPARK-5534]
      (The added unit tests also failed before but work with this fix.)
      
      Note: I used partitionsRDD, assuming that getStorageLevel will only be called on the driver.
      
      CC: mengxr  (related to LDA PR), rxin  ankurdave   Thanks in advance!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4317 from jkbradley/graphx-storagelevel and squashes the following commits:
      
      1c21e49 [Joseph K. Bradley] made graph getStorageLevel test more robust
      18d64ca [Joseph K. Bradley] Added tests for getStorageLevel in VertexRDDSuite, EdgeRDDSuite, GraphSuite
      17b488b [Joseph K. Bradley] overrode getStorageLevel in Vertex/EdgeRDDImpl to use partitionsRDD
      f133dece
    • Reynold Xin's avatar
      [SPARK-5514] DataFrame.collect should call executeCollect · 8aa3cfff
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4313 from rxin/SPARK-5514 and squashes the following commits:
      
      e34e91b [Reynold Xin] [SPARK-5514] DataFrame.collect should call executeCollect
      8aa3cfff
    • seayi's avatar
      [SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the... · dca6faa2
      seayi authored
      [SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the MetastoreRelation's sameresult method only compare databasename and table name)
      
      override  the MetastoreRelation's  sameresult method only compare databasename and table name
      
      because in previous :
      cache table t1;
      select count(*) from t1;
      it will read data from memory  but the sql below will not,instead it read from hdfs:
      select count(*) from t1 t;
      
      because cache data is keyed by logical plan and compare with sameResult ,so  when table with alias  the same table 's logicalplan is not the same logical plan with out alias  so modify  the sameresult method only compare databasename and table name
      
      Author: seayi <405078363@qq.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3898 from seayi/branch-1.2 and squashes the following commits:
      
      8f0c7d2 [seayi] Update CachedTableSuite.scala
      a277120 [seayi] Update HiveMetastoreCatalog.scala
      8d910aa [seayi] Update HiveMetastoreCatalog.scala
      dca6faa2
    • DB Tsai's avatar
      [SPARK-2309][MLlib] Multinomial Logistic Regression · b1aa8fe9
      DB Tsai authored
      #1379 is automatically closed by asfgit, and github can not reopen it once it's closed, so this will be the new PR.
      
      Binary Logistic Regression can be extended to Multinomial Logistic Regression by running K-1 independent Binary Logistic Regression models. The following formula is implemented.
      http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3833 from dbtsai/mlor and squashes the following commits:
      
      4e2f354 [DB Tsai] triger jenkins
      697b7c9 [DB Tsai] address some feedback
      4ce4d33 [DB Tsai] refactoring
      ff843b3 [DB Tsai] rebase
      f114135 [DB Tsai] refactoring
      4348426 [DB Tsai] Addressed feedback from Sean Owen
      a252197 [DB Tsai] first commit
      b1aa8fe9
    • Xiangrui Meng's avatar
      [SPARK-5513][MLLIB] Add nonnegative option to ml's ALS · 46d50f15
      Xiangrui Meng authored
      This PR ports the NNLS solver to the new ALS implementation.
      
      CC: coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4302 from mengxr/SPARK-5513 and squashes the following commits:
      
      4cbdab0 [Xiangrui Meng] fix serialization
      88de634 [Xiangrui Meng] add NNLS to ml's ALS
      46d50f15
    • Daoyuan Wang's avatar
      [SPARK-4508] [SQL] build native date type to conform behavior to Hive · 1646f89d
      Daoyuan Wang authored
      Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst.
      Subsumes #3381
      I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3732 from adrian-wang/datenative and squashes the following commits:
      
      0ed0fdc [Daoyuan Wang] fix test data
      a2fdd4e [Daoyuan Wang] getDate
      c37832b [Daoyuan Wang] row to catalyst
      f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion
      024c9a6 [Daoyuan Wang] clean some import order
      d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally
      374abd5 [Daoyuan Wang] spark native date type support
      1646f89d
    • Sandy Ryza's avatar
      SPARK-5500. Document that feeding hadoopFile into a shuffle operation wi... · 83093497
      Sandy Ryza authored
      ...ll cause problems
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4293 from sryza/sandy-spark-5500 and squashes the following commits:
      
      e9ce742 [Sandy Ryza] Change to warning
      cc46e52 [Sandy Ryza] Add instructions and extend to NewHadoopRDD
      6e1932a [Sandy Ryza] Throw exception on cache
      0f6c4eb [Sandy Ryza] SPARK-5500. Document that feeding hadoopFile into a shuffle operation will cause problems
      83093497
    • Joseph K. Bradley's avatar
      [SPARK-5461] [graphx] Add isCheckpointed, getCheckpointedFiles methods to Graph · 842d0003
      Joseph K. Bradley authored
      Added the 2 methods to Graph and GraphImpl.  Both make calls to the underlying vertex and edge RDDs.
      
      This is needed for another PR (for LDA): [https://github.com/apache/spark/pull/4047]
      
      Notes:
      * getCheckpointedFiles is plural and returns a Seq[String] instead of an Option[String].
      * I attempted to test to make sure the methods returned the correct values after checkpointing.  It did not work; I guess that checkpointing does not occur quickly enough?  I noticed that there are not checkpointing tests for RDDs; is it just hard to test well?
      
      CC: rxin
      
      CC: mengxr  (since related to LDA)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4253 from jkbradley/graphx-checkpoint and squashes the following commits:
      
      b680148 [Joseph K. Bradley] added class tag to firstParent call in VertexRDDImpl.isCheckpointed, though not needed to compile
      250810e [Joseph K. Bradley] In EdgeRDDImple, VertexRDDImpl, added transient back to partitionsRDD, and made isCheckpointed check firstParent instead of partitionsRDD
      695b7a3 [Joseph K. Bradley] changed partitionsRDD in EdgeRDDImpl, VertexRDDImpl to be non-transient
      cc00767 [Joseph K. Bradley] added overrides for isCheckpointed, getCheckpointFile in EdgeRDDImpl, VertexRDDImpl. The corresponding Graph methods now work.
      188665f [Joseph K. Bradley] improved documentation
      235738c [Joseph K. Bradley] Added isCheckpointed and getCheckpointFiles to Graph, GraphImpl
      842d0003
    • Jacek Lewandowski's avatar
      SPARK-5425: Use synchronised methods in system properties to create SparkConf · 5a552616
      Jacek Lewandowski authored
      SPARK-5425: Fixed usages of system properties
      
      This patch fixes few problems caused by the fact that the Scala wrapper over system properties is not thread-safe and is basically invalid because it doesn't take into account the default values which could have been set in the properties object. The problem is fixed by modifying `Utils.getSystemProperties` method so that it uses `stringPropertyNames` method of the `Properties` class, which is thread-safe (internally it creates a defensive copy in a synchronized method) and returns keys of the properties which were set explicitly and which are defined as defaults.
      The other related problem, which is fixed here. was in `ResetSystemProperties` mix-in. It created a copy of the system properties in the wrong way.
      
      This patch also introduces a test case for thread-safeness of SparkConf creation.
      
      Refer to the discussion in https://github.com/apache/spark/pull/4220 for more details.
      
      Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
      
      Closes #4222 from jacek-lewandowski/SPARK-5425-1.3 and squashes the following commits:
      
      03da61b [Jacek Lewandowski] SPARK-5425: Modified Utils.getSystemProperties to return a map of all system properties - explicit + defaults
      8faf2ea [Jacek Lewandowski] SPARK-5425: Use SerializationUtils to save properties in ResetSystemProperties trait
      71aa572 [Jacek Lewandowski] SPARK-5425: Use synchronised methods in system properties to create SparkConf
      5a552616
    • Martin Weindel's avatar
      Disabling Utils.chmod700 for Windows · bff65b5c
      Martin Weindel authored
      This patch makes Spark 1.2.1rc2 work again on Windows.
      
      Without it you get following log output on creating a Spark context:
      INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
      ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in .... Ignoring this directory.
      ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any local dir.
      
      Author: Martin Weindel <martin.weindel@gmail.com>
      Author: mweindel <m.weindel@usu-software.de>
      
      Closes #4299 from MartinWeindel/branch-1.2 and squashes the following commits:
      
      535cb7f [Martin Weindel] fixed last commit
      f17072e [Martin Weindel] moved condition to caller to avoid confusion on chmod700() return value
      4de5e91 [Martin Weindel] reverted to unix line ends
      fe2740b [mweindel] moved comment
      ac4749c [mweindel] fixed chmod700 for Windows
      bff65b5c
    • Marcelo Vanzin's avatar
      Make sure only owner can read / write to directories created for the job. · 52f5754f
      Marcelo Vanzin authored
      
      Whenever a directory is created by the utility method, immediately restrict
      its permissions so that only the owner has access to its contents.
      
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      52f5754f
    • Patrick Wendell's avatar
    • Iulian Dragos's avatar
      [SPARK-4631][streaming][FIX] Wait for a receiver to start before publishing test data. · e908322c
      Iulian Dragos authored
      This fixes two sources of non-deterministic failures in this test:
      
      - wait for a receiver to be up before pushing data through MQTT
      - gracefully handle the case where the MQTT client is overloaded. There’s
      a hard-coded limit of 10 in-flight messages, and this test may hit it.
      Instead of crashing, we retry sending the message.
      
      Both of these are needed to make the test pass reliably on my machine.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #4270 from dragos/issue/fix-flaky-test-SPARK-4631 and squashes the following commits:
      
      f66c482 [Iulian Dragos] [SPARK-4631][streaming] Wait for a receiver to start before publishing test data.
      d408a8e [Iulian Dragos] Install callback before connecting to MQTT broker.
      e908322c
    • Liang-Chi Hsieh's avatar
      [SPARK-5212][SQL] Add support of schema-less, custom field delimiter and SerDe for HiveQL transform · 683e9382
      Liang-Chi Hsieh authored
      This pr adds the support of schema-less syntax, custom field delimiter and SerDe for HiveQL's transform.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4014 from viirya/schema_less_trans and squashes the following commits:
      
      ac2d1fe [Liang-Chi Hsieh] Refactor codes for comments.
      a137933 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
      aa10fbd [Liang-Chi Hsieh] Add Hive golden answer files again.
      575f695 [Liang-Chi Hsieh] Add Hive golden answer files for new unit tests.
      a422562 [Liang-Chi Hsieh] Use createQueryTest for unit tests and remove unnecessary imports.
      ccb71e3 [Liang-Chi Hsieh] Refactor codes for comments.
      37bd391 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans
      6000889 [Liang-Chi Hsieh] Wrap input and output schema into ScriptInputOutputSchema.
      21727f7 [Liang-Chi Hsieh] Move schema-less output to proper place. Use multilines instead of a long line SQL.
      9a6dc04 [Liang-Chi Hsieh] setRecordReaderID is introduced in 0.13.1, use reflection API to call it.
      7a14f31 [Liang-Chi Hsieh] Fix bug.
      799b5e1 [Liang-Chi Hsieh] Call getSerializedClass instead of using Text.
      be2c3fc [Liang-Chi Hsieh] Fix style.
      32d3046 [Liang-Chi Hsieh] Add SerDe support.
      ab22f7b [Liang-Chi Hsieh] Fix style.
      7a48e42 [Liang-Chi Hsieh] Add support of custom field delimiter.
      b1729d9 [Liang-Chi Hsieh] Fix style.
      ccee49e [Liang-Chi Hsieh] Add unit test.
      f561c37 [Liang-Chi Hsieh] Add support of schema-less script transformation.
      683e9382
Loading