Skip to content
Snippets Groups Projects
  1. Nov 01, 2014
    • Xiangrui Meng's avatar
      [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading · d8176b1c
      Xiangrui Meng authored
      In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.
      
      This PR sets commons-math3 version based on hadoop profiles.
      
      pwendell JoshRosen srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:
      
      580f6d9 [Xiangrui Meng] replace tab by spaces
      7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
      d3353d9 [Xiangrui Meng] do not shade commons-math3
      b4180dc [Xiangrui Meng] temp work
      d8176b1c
    • Patrick Wendell's avatar
      Revert "[SPARK-4183] Enable NettyBlockTransferService by default" · 7894de27
      Patrick Wendell authored
      This reverts commit 59e626c7.
      7894de27
    • Cheng Lian's avatar
      [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2 · ad0fde10
      Cheng Lian authored
      `HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems.
      
      This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #2887 from liancheng/spark-4037 and squashes the following commits:
      
      8446675 [Cheng Lian] Removes redundant Driver initialization
      a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times
      49b1c5b [Cheng Lian] Reuses existing started SessionState if any
      3cd6fab [Cheng Lian] Fixes SPARK-4037
      ad0fde10
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
    • Xiangrui Meng's avatar
      [SPARK-3569][SQL] Add metadata field to StructField · 1d4f3552
      Xiangrui Meng authored
      Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.
      
      Metadata is preserved through simple operations like `SELECT`.
      
      marmbrus liancheng
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2701 from mengxr/structfield-metadata and squashes the following commits:
      
      dedda56 [Xiangrui Meng] merge remote
      5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
      886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
      589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      1e2abcf [Xiangrui Meng] change default value of metadata to None in python
      611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
      ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
      3f49aab [Xiangrui Meng] remove StructField.toString
      24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      473a7c5 [Xiangrui Meng] merge master
      c9d7301 [Xiangrui Meng] organize imports
      1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
      60cc131 [Xiangrui Meng] add doc and header
      60614c7 [Xiangrui Meng] add metadata
      e42c452 [Xiangrui Meng] merge master
      93518fb [Xiangrui Meng] support metadata in python
      905bb89 [Xiangrui Meng] java conversions
      618e349 [Xiangrui Meng] make tests work in scala
      61b8e0f [Xiangrui Meng] merge master
      7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
      c41a664 [Xiangrui Meng] merge master
      d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
      67fdebb [Xiangrui Meng] add test on join
      d65072e [Xiangrui Meng] remove Map.empty
      367d237 [Xiangrui Meng] add test
      c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
      1d4f3552
    • Aaron Davidson's avatar
      [SPARK-4183] Enable NettyBlockTransferService by default · 59e626c7
      Aaron Davidson authored
      Note that we're turning this on for at least the first part of the QA period as a trial. We want to enable this (and deprecate the NioBlockTransferService) as soon as possible in the hopes that NettyBlockTransferService will be more stable and easier to maintain. We will turn it off if we run into major issues.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3049 from aarondav/enable-netty and squashes the following commits:
      
      bb981cc [Aaron Davidson] [SPARK-4183] Enable NettyBlockTransferService by default
      59e626c7
    • Kevin Mader's avatar
      [SPARK-2759][CORE] Generic Binary File Support in Spark · 7136719b
      Kevin Mader authored
      The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
      As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
      Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
      A common use case might be to read in a folder
      ```
      sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
      ```
      
      Author: Kevin Mader <kevinmader@gmail.com>
      Author: Kevin Mader <kmader@users.noreply.github.com>
      
      Closes #1658 from kmader/master and squashes the following commits:
      
      3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
      359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
      6379be4 [Kevin Mader] reorganizing code
      7b9d181 [Kevin Mader] removing developer API, cleaning up imports
      8ac288b [Kevin Mader] fixed a single slightly over 100 character line
      92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
      a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
      c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
      b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
      0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
      4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
      19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
      238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
      932a206 [Kevin Mader] Update RawFileInput.scala
      a01c9cf [Kevin Mader] Update RawFileInput.scala
      441f79a [Kevin Mader] fixed a few small comments and dependency
      12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
      5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
      f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
      bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
      df8e528 [Kevin Mader] fixed line lengths and changed java test
      9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
      edf5829 [Kevin Mader] fixing line lengths, adding new lines
      f4841dc [Kevin Mader] un-optimizing imports, silly intellij
      eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
      1622935 [Kevin Mader] changing the line lengths to make jenkins happy
      1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
      84035f1 [Kevin Mader] adding binary and byte file support spark
      81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
      7136719b
    • luluorta's avatar
      [SPARK-4115][GraphX] Add overrided count for edge counting of EdgeRDD. · ee29ef38
      luluorta authored
      Accumulate sizes of all the EdgePartitions just like the VertexRDD.
      
      Author: luluorta <luluorta@gmail.com>
      
      Closes #2975 from luluorta/graph-edge-count and squashes the following commits:
      
      86ef0e5 [luluorta] Add overrided count for edge counting of EdgeRDD.
      ee29ef38
    • Joseph E. Gonzalez's avatar
      [SPARK-4142][GraphX] Default numEdgePartitions · f4e0b28c
      Joseph E. Gonzalez authored
      Changing the default number of edge partitions to match spark parallelism.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3006 from jegonzal/default_partitions and squashes the following commits:
      
      a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
      f4e0b28c
    • Daniel Lemire's avatar
      Upgrading to roaring 0.4.5 (bug fix release) · 680fd87c
      Daniel Lemire authored
      I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #3044 from lemire/master and squashes the following commits:
      
      54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
      048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
      431f3a0 [Daniel Lemire] Recommended bug fix release
      680fd87c
    • freeman's avatar
      Streaming KMeans [MLLIB][SPARK-3254] · 98c556eb
      freeman authored
      This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.
      
      The PR includes:
      - StreamingKMeans algorithm with decay factor settings
      - Usage example
      - Additions to documentation clustering page
      - Unit tests of basic behavior and decay behaviors
      
      tdas mengxr rezazadeh
      
      Author: freeman <the.freeman.lab@gmail.com>
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2942 from freeman-lab/streaming-kmeans and squashes the following commits:
      
      b2e5b4a [freeman] Fixes to docs / examples
      078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
      2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
      0411bf5 [freeman] Change decay parameterization
      9f7aea9 [freeman] Style fixes
      374a706 [freeman] Formatting
      ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
      77dbd3f [freeman] Make initialization check an assertion
      9cfc301 [freeman] Make random seed an argument
      44050a9 [freeman] Simpler constructor
      c7050d5 [freeman] Fix spacing
      2899623 [freeman] Use pattern matching for clarity
      a4a316b [freeman] Use collect
      1472ec5 [freeman] Doc formatting
      ea22ec8 [freeman] Fix imports
      2086bdc [freeman] Log cluster center updates
      ea9877c [freeman] More documentation
      9facbe3 [freeman] Bug fix
      5db7074 [freeman] Example usage for StreamingKMeans
      f33684b [freeman] Add explanation and example to docs
      b5b5f8d [freeman] Add better documentation
      a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
      9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
      b93350f [freeman] Streaming KMeans with decay
      98c556eb
  2. Oct 31, 2014
    • Manish Amde's avatar
      [MLLIB] SPARK-1547: Add Gradient Boosting to MLlib · 86021955
      Manish Amde authored
      Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.
      
      Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.
      
      Here is the task list:
      - [x] Gradient boosting support
      - [x] Pluggable loss functions
      - [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
      - [x] Binary classification support
      - [x] Support configurable checkpointing – This approach will avoid long lineage chains.
      - [x] Create classification and regression APIs
      - [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
      - [x] Unit Tests
      
      Future work:
      + Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
      + BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.
      
      cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      
      Closes #2607 from manishamde/gbt and squashes the following commits:
      
      991c7b5 [Manish Amde] public api
      ff2a796 [Manish Amde] addressing comments
      b4c1318 [Manish Amde] removing spaces
      8476b6b [Manish Amde] fixing line length
      0183cb9 [Manish Amde] fixed naming and formatting issues
      1c40c33 [Manish Amde] add newline, removed spaces
      e33ab61 [Manish Amde] minor comment
      eadbf09 [Manish Amde] parameter renaming
      035a2ed [Manish Amde] jkbradley formatting suggestions
      9f7359d [Manish Amde] simplified gbt logic and added more tests
      49ba107 [Manish Amde] merged from master
      eff21fe [Manish Amde] Added gradient boosting tests
      3fd0528 [Manish Amde] moved helper methods to new class
      a32a5ab [Manish Amde] added test for subsampling without replacement
      781542a [Manish Amde] added support for fractional subsampling with replacement
      3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
      0e81906 [Manish Amde] improving caching unpersisting logic
      d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
      fee06d3 [Manish Amde] added weighted ensemble model
      1b01943 [Manish Amde] add weights for base learners
      9bc6e74 [Manish Amde] adding random seed as parameter
      d2c8323 [Manish Amde] Merge branch 'master' into gbt
      2ae97b7 [Manish Amde] added documentation for the loss classes
      9366b8f [Manish Amde] minor: using numTrees instead of trees.size
      3b43896 [Manish Amde] added learning rate for prediction
      9b2e35e [Manish Amde] Merge branch 'master' into gbt
      6a11c02 [manishamde] fixing formatting
      823691b [Manish Amde] fixing RF test
      1f47941 [Manish Amde] changing access modifier
      5b67102 [Manish Amde] shortened parameter list
      5ab3796 [Manish Amde] minor reformatting
      9155a9d [Manish Amde] consolidated boosting configuration and added public API
      631baea [Manish Amde] Merge branch 'master' into gbt
      2cb1258 [Manish Amde] public API support
      3b8ffc0 [Manish Amde] added documentation
      8e10c63 [Manish Amde] modified unpersist strategy
      f62bc48 [Manish Amde] added unpersist
      bdca43a [Manish Amde] added timing parameters
      2fbc9c7 [Manish Amde] fixing binomial classification prediction
      6dd4dd8 [Manish Amde] added support for log loss
      9af0231 [Manish Amde] classification attempt
      62cc000 [Manish Amde] basic checkpointing
      4784091 [Manish Amde] formatting
      78ed452 [Manish Amde] added newline and fixed if statement
      3973dd1 [Manish Amde] minor indicating subsample is double during comparison
      aa8fae7 [Manish Amde] minor refactoring
      1a8031c [Manish Amde] sampling with replacement
      f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
      cdceeef [Manish Amde] added documentation
      6251fd5 [Manish Amde] modified method name
      5538521 [Manish Amde] disable checkpointing for now
      0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches
      86021955
    • Anant's avatar
      [SPARK-3838][examples][mllib][python] Word2Vec example in python · e07fb6a4
      Anant authored
      This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
      
      Python example for word2vec
      mengxr
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #2952 from anantasty/SPARK-3838 and squashes the following commits:
      
      87bd723 [Anant] remove stop line
      4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
      3d3c9ee [Anant] Added empty line after python imports
      0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
      ee4f5f6 [Anant] Fixes from code review comments
      c637bcf [Anant] Added word2vec python example to docs
      269f31f [Anant] added example in docs
      c015b14 [Anant] Added python example for word2vec
      e07fb6a4
    • Alexander Ulanov's avatar
      [MLLIB] SPARK-2329 Add multi-label evaluation metrics · 62d01d25
      Alexander Ulanov authored
      Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329
      
      Multi-class measures are currently in the following pull request: https://github.com/apache/spark/pull/1155
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      Author: avulanov <nashb@yandex.ru>
      
      Closes #1270 from avulanov/multilabelmetrics and squashes the following commits:
      
      fc8175e [Alexander Ulanov] Merge with previous updates
      43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array
      517a594 [avulanov] Addressing reviewers comments: Scala style
      cf4222bc [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels
      1843f73 [Alexander Ulanov] Scala style fix
      79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen
      ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation
      40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests
      ad62df0 [Alexander Ulanov] Comments and scala style check
      154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class
      62d01d25
    • Sandy Ryza's avatar
      SPARK-4175. Exception on stage page · 23f73f52
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3043 from sryza/sandy-spark-4175 and squashes the following commits:
      
      e327340 [Sandy Ryza] SPARK-4175. Exception on stage page
      23f73f52
    • andrewor14's avatar
      [HOT FIX] Yarn stable tests don't compile · 087e31a7
      andrewor14 authored
      This is caused by this commit: acd4ac7c
      
      Author: andrewor14 <andrew@databricks.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3041 from andrewor14/yarn-hot-fix and squashes the following commits:
      
      e5deba1 [andrewor14] Add new line at the end (minor)
      aa998e8 [Andrew Or] Compilation hot fix
      087e31a7
    • Kousuke Saruta's avatar
      [SPARK-3870] EOL character enforcement · 55ab7770
      Kousuke Saruta authored
      We have shell scripts and Windows batch files, so we should enforce proper EOL character.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2726 from sarutak/eol-enforcement and squashes the following commits:
      
      9748c3f [Kousuke Saruta] Fixed make.bat
      252de89 [Kousuke Saruta] Removed extra characters from make.bat
      5b81c00 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      8633ed2 [Kousuke Saruta] merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      5d630d8 [Kousuke Saruta] Merged
      ba10797 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      7407515 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      772fd4e [Kousuke Saruta] Normized EOL character in make.bat and compute-classpath.cmd
      ac7f873 [Kousuke Saruta] Added an entry for .gitattributes to .rat-excludes
      1570e77 [Kousuke Saruta] Added .gitattributes
      55ab7770
    • Xiangrui Meng's avatar
      [SPARK-4150][PySpark] return self in rdd.setName · f1e7361f
      Xiangrui Meng authored
      Then we can do `rdd.setName('abc').cache().count()`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3011 from mengxr/rdd-setname and squashes the following commits:
      
      10d0d60 [Xiangrui Meng] update test
      4ac3bbd [Xiangrui Meng] return self in rdd.setName
      f1e7361f
    • Mark Mims's avatar
      [SPARK-4141] Hide Accumulators column on stage page when no accumulators exist · a68ecf32
      Mark Mims authored
      WebUI
      
      Author: Mark Mims <mark.mims@canonical.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3031 from mmm/remove-accumulators-col and squashes the following commits:
      
      6141cb3 [Mark Mims] reformat to satisfy scalastyle linelength.  build failed from jenkins https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22604/
      390893b [Mark Mims] cleanup
      c28c449 [Mark Mims] looking much better now... minimal explicit formatting.  Now, see if any sort keys make sense
      fb72156 [Mark Mims] mimic hasInput.  The basics work here, but wanna clean this up with maybeAccumulators for column content
      a68ecf32
    • Cheng Lian's avatar
      [SPARK-2220][SQL] Fixes remaining Hive commands · 23468e7e
      Cheng Lian authored
      This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3038 from liancheng/hive-commands and squashes the following commits:
      
      6db61e0 [Cheng Lian] Fixes remaining Hive commands
      23468e7e
    • ravipesala's avatar
      [SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL and HQL · ea465af1
      ravipesala authored
      if the query contains "not between" does not work like.
      SELECT * FROM src where key not between 10 and 20'
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits:
      
      65fc89e [ravipesala] Handled admin comments
      32e6d42 [ravipesala] 'not between' is not working
      ea465af1
    • Venkata Ramana Gollamudi's avatar
      [SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp values · fa712b30
      Venkata Ramana Gollamudi authored
      In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
      However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #3019 from gvramana/spark_4077 and squashes the following commits:
      
      32d818f [Venkata Ramana Gollamudi] fixed check style
      fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
      fa712b30
    • wangfei's avatar
      [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 · 7c41d135
      wangfei authored
       In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.
      
       1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility
      
       2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version
      
       3 SBT cmd for different version as follows:
         hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
         hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly
      
       4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:
      
      f26f3be [wangfei] remove clean to save time
      f5cac74 [wangfei] remove local hivecontext test
      578234d [wangfei] use new shaded hive
      18fb1ff [wangfei] exclude kryo in hive pom
      fa21d09 [wangfei] clean package assembly/assembly
      8a4daf2 [wangfei] minor fix
      0d7f6cf [wangfei] address comments
      f7c93ae [wangfei] adding build with hive 0.13 before running tests
      bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      c359822 [wangfei] reuse getCommandProcessor in hiveshim
      52674a4 [scwf] sql/hive included since examples depend on it
      3529e98 [scwf] move hive module to hive profile
      f51ff4e [wangfei] update and fix conflicts
      f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      41f727b [scwf] revert pom changes
      13afde0 [scwf] fix small bug
      4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
      0bc53aa [scwf] fixed when result filed is null
      dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
      c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
      7c66b8e [scwf] update pom according spark-2706
      ae47489 [scwf] update and fix conflicts
      7c41d135
    • Kay Ousterhout's avatar
      [SPARK-4016] Allow user to show/hide UI metrics. · adb6415c
      Kay Ousterhout authored
      This commit adds a set of checkboxes to the stage detail
      page that the user can use to show additional task metrics,
      including the GC time, result serialization time, result fetch
      time, and scheduler delay.  All of these metrics are now
      hidden by default.  This allows advanced users to look at more
      detailed metrics, without distracting the average user.
      
      This change also cleans up the stage detail page so that metrics
      are shown in the same order in the summary table as in the task table,
      and updates the metrics in both tables such that they contain the same
      set of metrics.
      
      The ability to remember a user's preferences for which metrics
      should be shown has been filed as SPARK-4024.
      
      Here's what the stage detail page looks like by default:
      ![image](https://cloud.githubusercontent.com/assets/1108612/4744322/3ebe319e-5a2f-11e4-891f-c792be79caa2.png)
      
      and once a user clicks "Show additional metrics" (note that all the metrics get checked by default):
      ![image](https://cloud.githubusercontent.com/assets/1108612/4744332/51e5abda-5a2f-11e4-8994-d0d3705ee05d.png)
      
      cc shivaram andrewor14
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #2867 from kayousterhout/SPARK-checkboxes and squashes the following commits:
      
      6015913 [Kay Ousterhout] Added comment
      08dee73 [Kay Ousterhout] Josh's usability comments
      0940d61 [Kay Ousterhout] Style updates based on Andrew's review
      ef05ccd [Kay Ousterhout] Added tooltips
      d7cfaaf [Kay Ousterhout] Made list of add'l metrics collapsible.
      70c1fb5 [Kay Ousterhout] [SPARK-4016] Allow user to show/hide UI metrics.
      adb6415c
    • Sandy Ryza's avatar
      SPARK-3837. Warn when YARN kills containers for exceeding memory limits · acd4ac7c
      Sandy Ryza authored
      I triggered the issue and verified the message gets printed on a pseudo-distributed cluster.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2744 from sryza/sandy-spark-3837 and squashes the following commits:
      
      858a268 [Sandy Ryza] Review feedback
      c937f00 [Sandy Ryza] SPARK-3837. Warn when YARN kills containers for exceeding memory limits
      acd4ac7c
    • Cheng Hao's avatar
      [SPARK-4143] [SQL] Move inner class DeferredObjectAdapter to top level · 58a6077e
      Cheng Hao authored
      The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3007 from chenghao-intel/move_deferred and squashes the following commits:
      
      3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level
      58a6077e
    • Anant's avatar
      [SPARK-4108][SQL] Fixed usage of deprecated in sql/catalyst/types/datatypes · d31517a3
      Anant authored
      Fixed usage of deprecated in sql/catalyst/types/datatypes to have versio...n parameter
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #2970 from anantasty/SPARK-4108 and squashes the following commits:
      
      e92cb01 [Anant] Fixed usage of deprecated in sql/catalyst/types/datatypes to have version parameter
      d31517a3
    • Erik Erlandson's avatar
      [SPARK-3250] Implement Gap Sampling optimization for random sampling · ad3bd0df
      Erik Erlandson authored
      More efficient sampling, based on Gap Sampling optimization:
      http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
      
      Author: Erik Erlandson <eerlands@redhat.com>
      
      Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits:
      
      72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling
      ad3bd0df
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
  3. Oct 30, 2014
    • Patrick Wendell's avatar
      HOTFIX: Clean up build in network module. · 0734d093
      Patrick Wendell authored
      This is currently breaking the package build for some people (including me).
      
      This patch does some general clean-up which also fixes the current issue.
      - Uses consistent artifact naming
      - Adds sbt support for this module
      - Changes tests to use scalatest (fixes the original issue[1])
      
      One thing to note, it turns out that scalatest when invoked in the
      Maven build doesn't succesfully detect JUnit Java tests. This is
      a long standing issue, I noticed it applies to all of our current
      test suites as well. I've created SPARK-4159 to fix this.
      
      [1] The original issue is that we need to allocate extra memory
      for the tests, happens by default in our scalatest configuration.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3025 from pwendell/hotfix and squashes the following commits:
      
      faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
      0734d093
    • Andrew Or's avatar
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use... · 26d31d15
      Andrew Or authored
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"
      
      This reverts commit 68cb69da.
      26d31d15
    • Yash Datta's avatar
      [SPARK-3968][SQL] Use parquet-mr filter2 api · 2e35e242
      Yash Datta authored
      The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
      We can leverage that to further improve performance of queries with filters.
      Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #2841 from saucam/master and squashes the following commits:
      
      8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
      515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
      5f4530e [Yash Datta] SPARK-3968: Fix scala code style
      f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
      ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
      48163c3 [Yash Datta] SPARK-3968: Code cleanup
      cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
      caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
      49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
      9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
      2e35e242
    • ravipesala's avatar
      [SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM... · 9b6ebe33
      ravipesala authored
      [SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL
      
      Right now it works for only 2 tables like below query.
      sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ")
      
      But it does not work for more than 2 tables like below query
      sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key").
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2987 from ravipesala/multijoin and squashes the following commits:
      
      429b005 [ravipesala] Support multiple joins
      9b6ebe33
    • Sean Owen's avatar
      SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da
      Sean Owen authored
      (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2814 from srowen/SPARK-1209 and squashes the following commits:
      
      ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      68cb69da
    • Andrew Or's avatar
      [SPARK-3661] Respect spark.*.memory in cluster mode · 2f545438
      Andrew Or authored
      This also includes minor re-organization of the code. Tested locally in both client and deploy modes.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2697 from andrewor14/memory-cluster-mode and squashes the following commits:
      
      01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
      ccd468b [Andrew Or] Add some comments per Patrick
      c956577 [Andrew Or] Tweak wording
      2b4afa0 [Andrew Or] Unused import
      47a5a88 [Andrew Or] Correct Spark properties precedence order
      bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
      dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode
      2f545438
    • zsxwing's avatar
      [SPARK-4153][WebUI] Update the sort keys for HistoryPage · d3450578
      zsxwing authored
      Sort "Started", "Completed", "Duration" and "Last Updated" by time.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3014 from zsxwing/SPARK-4153 and squashes the following commits:
      
      ec8b9ad [zsxwing] Sort "Started", "Completed", "Duration" and "Last Updated" by time
      d3450578
    • Andrew Or's avatar
      Minor style hot fix after #2711 · 849b43ec
      Andrew Or authored
      I had planned to fix this when I merged it but I forgot to. witgo
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3018 from andrewor14/command-utils-style and squashes the following commits:
      
      c2959fb [Andrew Or] Style hot fix
      849b43ec
    • Andrew Or's avatar
      [SPARK-4155] Consolidate usages of <driver> · 9334d699
      Andrew Or authored
      We use "\<driver\>" everywhere. Let's not do that.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3020 from andrewor14/consolidate-driver and squashes the following commits:
      
      c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
      3d751e9 [Andrew Or] Consolidate usages of <driver>
      9334d699
    • Andrew Or's avatar
      [Minor] A few typos in comments and log messages · 5231a3f2
      Andrew Or authored
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3021 from andrewor14/typos and squashes the following commits:
      
      daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
      4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
      026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
      a81ae8f [Andrew Or] Some typos
      5231a3f2
    • Andrew Or's avatar
      [SPARK-4138][SPARK-4139] Improve dynamic allocation settings · 26f092d4
      Andrew Or authored
      This should be merged after #2746 (SPARK-3795).
      
      **SPARK-4138**. If the user sets both the number of executors and `spark.dynamicAllocation.enabled`, we should throw an exception.
      
      **SPARK-4139**. If the user sets `spark.dynamicAllocation.enabled`, we should use the max number of executors as the starting number of executors because the first job is likely to run immediately after application startup. If the latter is not set, throw an exception.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3002 from andrewor14/yarn-set-executors and squashes the following commits:
      
      c528fce [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-set-executors
      55d4699 [Andrew Or] Bug fix: `isDynamicAllocationEnabled` was always false
      2b0ccec [Andrew Or] Start the number of executors at the max
      022bfde [Andrew Or] Guard against incompatible settings of number of executors
      26f092d4
Loading