Skip to content
Snippets Groups Projects
  1. May 12, 2015
    • Andrew Or's avatar
      [SPARK-7500] DAG visualization: move cluster labeling to dagre-d3 · 65697bbe
      Andrew Or authored
      This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular:
      
      (1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark
      (2) All graph styling is now moved from Scala to JS
      
      Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3.
      
      -------------------
      
      **Before.**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/>
      
      -------------------
      
      **After.**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits:
      
      5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed
      c686dc4 [Andrew Or] Fix tooltip placement
      d908c36 [Andrew Or] Add link to dagre-d3 changes (minor)
      4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS
      65697bbe
    • Wenchen Fan's avatar
      [DataFrame][minor] support column in field accessor · bfcaf8ad
      Wenchen Fan authored
      Minor improvement, now we can use `Column` as extraction expression.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6080 from cloud-fan/tmp and squashes the following commits:
      
      0fdefb7 [Wenchen Fan] support column in field accessor
      bfcaf8ad
    • Cheng Lian's avatar
      [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API · 0595b6de
      Cheng Lian authored
      This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.
      
      ## New features provided
      
      1. Hive compatible partition discovery
      
         This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.
      
      1. Generalized partition pruning optimization
      
         Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.
      
         (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)
      
      1. Insertion with dynamic partitions
      
         When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.
      
      ## New structures provided
      
      ### Developer API
      
      1. `FSBasedRelation`
      
         Base abstract class for file system based data sources.
      
      1. `OutputWriter`
      
         Base abstract class for output row writers, responsible for writing a single row object.
      
      1. `FSBasedRelationProvider`
      
         A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.
      
      ### User API
      
      New overloaded versions of
      
      1. `DataFrame.save()`
      1. `DataFrame.saveAsTable()`
      1. `SQLContext.load()`
      
      are provided to allow users to save/load DataFrames with user defined dynamic partition columns.
      
      ### Spark SQL query planning
      
      1. `InsertIntoFSBasedRelation`
      
         Used to implement write path for `FSBasedRelation`s.
      
      1. New rules for `FSBasedRelation` in `DataSourceStrategy`
      
         These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.
      
      ## TODO
      
      - [ ] Use scratch directories when overwriting a table with data selected from itself.
      
            Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.
      
      - [ ] When inserting with dynamic partition columns, use external sorter to group the data first.
      
            This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.
      
      - [x] More tests.  Specifically, test cases for
      
            - [x] Self-join
            - [x] Loading partitioned relations with a subset of partition columns stored in data files.
            - [x] `SQLContext.load()` with user defined dynamic partition columns.
      
      ## Parquet data source migration
      
      Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5526 from liancheng/partitioning-support and squashes the following commits:
      
      5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
      1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
      43ba50e [Cheng Lian] Avoids serializing generated projection code
      edf49e7 [Cheng Lian] Removed commented stale code block
      348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
      ad4d4de [Cheng Lian] Enables HDFS style globbing
      8d12e69 [Cheng Lian] Fixes compilation error
      c71ac6c [Cheng Lian] Addresses comments from @marmbrus
      7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
      0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
      52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
      c466de6 [Cheng Lian] Addresses comments
      bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
      795920a [Cheng Lian] Fixes compilation error after rebasing
      0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
      fa543f3 [Cheng Lian] Addresses comments
      5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
      51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
      c4ed4fe [Cheng Lian] Bug fixes and a new test suite
      a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
      5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
      54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
      be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
      0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
      f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
      422ff4a [Cheng Lian] Fixes style issue
      ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
      8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
      ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
      f18dec2 [Cheng Lian] More strict schema checking
      b746ab5 [Cheng Lian] More tests
      9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
      ea6c8dd [Cheng Lian] Removes remote debugging stuff
      327bb1d [Cheng Lian] Implements partitioning support for data sources API
      3c5073a [Cheng Lian] Fixes SaveModes used in test cases
      fb5a607 [Cheng Lian] Fixes compilation error
      9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
      5de194a [Cheng Lian] Forgot Apache licence header
      95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
      770b5ba [Cheng Lian] Adds tests for FSBasedRelation
      3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
      1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
      aa8ba9a [Cheng Lian] Javadoc fix
      012ed2d [Cheng Lian] Adds PartitioningOptions
      7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
      0595b6de
    • Wenchen Fan's avatar
      [DataFrame][minor] cleanup unapply methods in DataTypes · 831504cf
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6079 from cloud-fan/unapply and squashes the following commits:
      
      40da442 [Wenchen Fan] one more
      7d90a05 [Wenchen Fan] cleanup unapply in DataTypes
      831504cf
    • Daoyuan Wang's avatar
      [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark · d86ce845
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #6003 from adrian-wang/pynareplace and squashes the following commits:
      
      672efba [Daoyuan Wang] remove py2.7 feature
      4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests
      9e232e7 [Daoyuan Wang] rename scala map
      af0268a [Daoyuan Wang] remove na
      63ac579 [Daoyuan Wang] add na.replace in pyspark
      d86ce845
    • Tathagata Das's avatar
      [SPARK-7532] [STREAMING] StreamingContext.start() made to logWarning and not throw exception · ec6f2a97
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6060 from tdas/SPARK-7532 and squashes the following commits:
      
      6fe2e83 [Tathagata Das] Update docs
      7dadfc3 [Tathagata Das] Fixed bug again
      99c7678 [Tathagata Das] Added logInfo
      65aec20 [Tathagata Das] Fix bug
      5bf031b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7532
      1a9a818 [Tathagata Das] Fix scaladoc
      c584313 [Tathagata Das] StreamingContext.start() made to logWarning and not throw exception
      ec6f2a97
    • Andrew Or's avatar
      [SPARK-7467] Dag visualization: treat checkpoint as an RDD operation · f3e8e600
      Andrew Or authored
      Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits:
      
      9217439 [Andrew Or] Fix checkpoints
      4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint
      19bc07b [Andrew Or] Treat checkpoint as an RDD operation
      f3e8e600
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
    • linweizhong's avatar
      [MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark · 98478752
      linweizhong authored
      As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits:
      
      8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
      98478752
    • zsxwing's avatar
      [SPARK-7534] [CORE] [WEBUI] Fix the Stage table when a stage is missing · 8a4edecc
      zsxwing authored
      Just improved the Stage table when a stage is missing.
      
      Before:
      
      ![screen shot 2015-05-11 at 10 11 51 am](https://cloud.githubusercontent.com/assets/1000778/7570842/2ba37380-f7c8-11e4-9b5f-cf1a6264b2a4.png)
      
      After:
      
      ![screen shot 2015-05-11 at 10 26 09 am](https://cloud.githubusercontent.com/assets/1000778/7570848/33703152-f7c8-11e4-81a8-d53dd72d7b8d.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6061 from zsxwing/SPARK-7534 and squashes the following commits:
      
      09fe862 [zsxwing] Leave it blank rather than '-'
      6299197 [zsxwing] Fix the Stage table when a stage is missing
      8a4edecc
    • vidmantas zemleris's avatar
      [SPARK-6994][SQL] Update docs for fetching Row fields by name · 640f63b9
      vidmantas zemleris authored
      add docs for https://issues.apache.org/jira/browse/SPARK-6994
      
      Author: vidmantas zemleris <vidmantas@vinted.com>
      
      Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:
      
      241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
      640f63b9
    • Reynold Xin's avatar
      [SQL] Rename Dialect -> ParserDialect. · 16696759
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6071 from rxin/parserdialect and squashes the following commits:
      
      ca2eb31 [Reynold Xin] Rename Dialect -> ParserDialect.
      16696759
  2. May 11, 2015
    • Joshi's avatar
      [SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark · b94a9337
      Joshi authored
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:
      
      cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
      62becc1 [Joshi] SPARK-7435: Update to DataFrame
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      b94a9337
    • Reynold Xin's avatar
      [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns. · 028ad4bd
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6068 from rxin/drop-column and squashes the following commits:
      
      9d7d5ec [Reynold Xin] [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.
      028ad4bd
    • Zhongshuai Pei's avatar
      [SPARK-7437] [SQL] Fold "literal in (item1, item2, ..., literal, ...)" into true or false directly · 4b5e1fe9
      Zhongshuai Pei authored
      SQL
      ```
      select key from src where 3 in (4, 5);
      ```
      Before
      ```
      == Optimized Logical Plan ==
      Project [key#12]
       Filter 3 INSET (5,4)
        MetastoreRelation default, src, None
      ```
      
      After
      ```
      == Optimized Logical Plan ==
      LocalRelation [key#228], []
      ```
      
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5972 from DoingDone9/InToFalse and squashes the following commits:
      
      4c722a2 [Zhongshuai Pei] Update predicates.scala
      abe2bbb [Zhongshuai Pei] Update Optimizer.scala
      fa461a5 [Zhongshuai Pei] Update Optimizer.scala
      e34c28a [Zhongshuai Pei] Update predicates.scala
      24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      35ceb7a [Zhongshuai Pei] Update Optimizer.scala
      36c194e [Zhongshuai Pei] Update Optimizer.scala
      2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
      14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
      f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
      f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
      f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
      34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      4b5e1fe9
    • Cheng Hao's avatar
      [SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS · e35d878b
      Cheng Hao authored
      This is a follow up of #5876 and should be merged after #5876.
      
      Let's wait for unit testing result from Jenkins.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits:
      
      f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
      a8260e8 [Cheng Hao] Update code as feedback
      f4e243f [Cheng Hao] remove the serde setting for SequenceFile
      d166afa [Cheng Hao] style issue
      d25a4aa [Cheng Hao] Add SerDe support for CTAS
      e35d878b
    • Reynold Xin's avatar
      [SPARK-7324] [SQL] DataFrame.dropDuplicates · b6bf4f76
      Reynold Xin authored
      This should also close https://github.com/apache/spark/pull/5870
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6066 from rxin/dropDups and squashes the following commits:
      
      130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
      b6bf4f76
    • Tathagata Das's avatar
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the... · f9c7580a
      Tathagata Das authored
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the current state of the context
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6058 from tdas/SPARK-7530 and squashes the following commits:
      
      80ee0e6 [Tathagata Das] STARTED --> ACTIVE
      3da6547 [Tathagata Das] Added synchronized
      dd88444 [Tathagata Das] Added more docs
      e1a8505 [Tathagata Das] Fixed comment length
      89f9980 [Tathagata Das] Change to Java enum and added Java test
      7c57351 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      dd4e702 [Tathagata Das] Addressed comments.
      3d56106 [Tathagata Das] Added Mima excludes
      2b86ba1 [Tathagata Das] Added scala docs.
      1722433 [Tathagata Das] Fixed style
      976b094 [Tathagata Das] Added license
      0585130 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      e0f0a05 [Tathagata Das] Added getState and exposed StreamingContextState
      f9c7580a
    • Xusen Yin's avatar
      [SPARK-5893] [ML] Add bucketizer · 35fb42a0
      Xusen Yin authored
      JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).
      
      One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,
      
      ```scala
      buckets = Array(-0.5, 0.0, 0.5)
      ```
      
      splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:
      
      dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
      1ca973a [Joseph K. Bradley] one more bucketizer test
      34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
      eacfcfa [Xusen Yin] change ML attribute from splits into buckets
      c3cc770 [Xusen Yin] add more unit test for binary search
      3a16cc2 [Xusen Yin] refine comments and names
      ac77859 [Xusen Yin] fix style error
      fb30d79 [Xusen Yin] fix and test binary search
      2466322 [Xusen Yin] refactor Bucketizer
      11fb00a [Xusen Yin] change it into an Estimator
      998bc87 [Xusen Yin] check buckets
      4024cf1 [Xusen Yin] add test suite
      5fe190e [Xusen Yin] add bucketizer
      35fb42a0
    • Reynold Xin's avatar
      Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket. · 87229c95
      Reynold Xin authored
      So users that are interested in this can track it easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6067 from rxin/SPARK-7550 and squashes the following commits:
      
      ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.
      87229c95
    • Reynold Xin's avatar
      [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames. · 3a9b6997
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6062 from rxin/agg-retain-doc and squashes the following commits:
      
      43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
      3a9b6997
    • madhukar's avatar
      [SPARK-7084] improve saveAsTable documentation · 57255dcd
      madhukar authored
      Author: madhukar <phatak.dev@gmail.com>
      
      Closes #5654 from phatak-dev/master and squashes the following commits:
      
      386f407 [madhukar] #5654 updated for all the methods
      2c997c5 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      00bc819 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      2a802c6 [madhukar] #5654 updated the doc according to comments
      866e8df [madhukar] [SPARK-7084] improve saveAsTable documentation
      57255dcd
    • Reynold Xin's avatar
      [SQL] Show better error messages for incorrect join types in DataFrames. · 4f4dbb03
      Reynold Xin authored
      As a follow-up to https://github.com/apache/spark/pull/5944
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6064 from rxin/jointype-better-error and squashes the following commits:
      
      7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames.
      4f4dbb03
    • Sean Owen's avatar
      [MINOR] [DOCS] Fix the link to test building info on the wiki · 91dc3dfd
      Sean Owen authored
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6063 from srowen/FixRunningTestsLink and squashes the following commits:
      
      db62018 [Sean Owen] Fix the link to test building info on the wiki
      91dc3dfd
    • LCY Vincent's avatar
      Update Documentation: leftsemi instead of semijoin · a8ea0968
      LCY Vincent authored
      should sync up with here?
      https://github.com/apache/spark/blob/119f45d61d7b48d376cca05e1b4f0c7fcf65bfa8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala#L26
      
      Author: LCY Vincent <lauchunyin@gmail.com>
      
      Closes #5944 from vincentlaucy/master and squashes the following commits:
      
      fc0e454 [LCY Vincent] Update DataFrame.scala
      a8ea0968
    • jerryshao's avatar
      [STREAMING] [MINOR] Close files correctly when iterator is finished in streaming WAL recovery · 25c01c54
      jerryshao authored
      Currently there's no chance to close the file correctly after the iteration is finished, change to `CompletionIterator` to avoid resource leakage.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #6050 from jerryshao/close-file-correctly and squashes the following commits:
      
      52dfaf5 [jerryshao] Close files correctly when iterator is finished
      25c01c54
    • gchen's avatar
      [SPARK-7516] [Minor] [DOC] Replace depreciated inferSchema() with createDataFrame() · 8e674331
      gchen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7516
      
      In sql-programming-guide, deprecated python data frame api inferSchema() should be replaced by createDataFrame():
      
      schemaPeople = sqlContext.inferSchema(people) ->
      schemaPeople = sqlContext.createDataFrame(people)
      
      Author: gchen <chenguancheng@gmail.com>
      
      Closes #6041 from gchen/python-docs and squashes the following commits:
      
      c27eb7c [gchen] replace inferSchema() with createDataFrame()
      8e674331
    • Kousuke Saruta's avatar
      [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode · 6e9910c2
      Kousuke Saruta authored
      Now PySpark on YARN with cluster mode is supported so let's update doc.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:
      
      ad9f88c [Kousuke Saruta] Brushed up sentences
      469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
      fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
      6e9910c2
    • Steve Loughran's avatar
      [SPARK-7508] JettyUtils-generated servlets to log & report all errors · 7ce2a33c
      Steve Loughran authored
      Patch for SPARK-7508
      
      This logs  warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500.
      
      In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way.
      
      This patch doesn't address that, it ensures that
      1. if the jetty handlers were put to one side the users would see the errors
      2. at least the exceptions appear in the server-side logs.
      
      This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits:
      
      584836f [Steve Loughran] SPARK-7508 drop trailing semicolon
      ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed
      258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed
      69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors
      7ce2a33c
    • Sandy Ryza's avatar
      [SPARK-6470] [YARN] Add support for YARN node labels. · 82fee9d9
      Sandy Ryza authored
      This is difficult to write a test for because it relies on the latest version of YARN, but I verified manually that the patch does pass along the label expression on this version and containers are successfully launched.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #5242 from sryza/sandy-spark-6470 and squashes the following commits:
      
      6af87b9 [Sandy Ryza] Change info to warning
      6e22d99 [Sandy Ryza] [YARN] SPARK-6470.  Add support for YARN node labels.
      82fee9d9
    • Reynold Xin's avatar
      [SPARK-7462] By default retain group by columns in aggregate · 0a4844f9
      Reynold Xin authored
      Updated Java, Scala, Python, and R.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5996 from rxin/groupby-retain and squashes the following commits:
      
      aac7119 [Reynold Xin] Merge branch 'groupby-retain' of github.com:rxin/spark into groupby-retain
      f6858f6 [Reynold Xin] Merge branch 'master' into groupby-retain
      5f923c0 [Reynold Xin] Merge pull request #15 from shivaram/sparkr-groupby-retrain
      c1de670 [Shivaram Venkataraman] Revert workaround in SparkR to retain grouped cols Based on reverting code added in commit https://github.com/amplab-extras/spark/commit/9a6be746efc9fafad88122fa2267862ef87aa0e1
      b8b87e1 [Reynold Xin] Fixed DataFrameJoinSuite.
      d910141 [Reynold Xin] Updated rest of the files
      1e6e666 [Reynold Xin] [SPARK-7462] By default retain group by columns in aggregate
      0a4844f9
    • Tathagata Das's avatar
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start... · 1b465569
      Tathagata Das authored
      [SPARK-7361] [STREAMING] Throw unambiguous exception when attempting to start multiple StreamingContexts in the same JVM
      
      Currently attempt to start a streamingContext while another one is started throws a confusing exception that the action name JobScheduler is already registered. Instead its best to throw a proper exception as it is not supported.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5907 from tdas/SPARK-7361 and squashes the following commits:
      
      fb81c4a [Tathagata Das] Fix typo
      a9cd5bb [Tathagata Das] Added startSite to StreamingContext
      5fdfc0d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7361
      5870e2b [Tathagata Das] Added check for multiple streaming contexts
      1b465569
    • Bryan Cutler's avatar
      [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option · 4f8a1551
      Bryan Cutler authored
      As is, to specify this option on command line, you have to escape the angle brackets.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
      
      b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
      4f8a1551
    • Yanbo Liang's avatar
      [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlib · 042dda3c
      Yanbo Liang authored
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6044 from yanboliang/spark-6092 and squashes the following commits:
      
      726a9b1 [Yanbo Liang] add newRankingMetrics
      33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib
      042dda3c
    • Wesley Miao's avatar
      [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time · d70a0768
      Wesley Miao authored
      tdas
      
      https://issues.apache.org/jira/browse/SPARK-7326
      
      The problem most likely resides in DStream.slice() implementation, as shown below.
      
        def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = {
          if (!isInitialized) {
            throw new SparkException(this + " has not been initialized")
          }
          if (!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("fromTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          if (!(toTime - zeroTime).isMultipleOf(slideDuration)) {
            logWarning("toTime (" + fromTime + ") is not a multiple of slideDuration ("
              + slideDuration + ")")
          }
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
          logInfo("Slicing from " + fromTime + " to " + toTime +
            " (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
      
          alignedFromTime.to(alignedToTime, slideDuration).flatMap(time => {
            if (time >= zeroTime) getOrCompute(time) else None
          })
        }
      
      Here after performing floor() on both fromTime and toTime, the result (alignedFromTime - zeroTime) and (alignedToTime - zeroTime) may no longer be multiple of the slidingDuration, thus making isTimeValid() check failed for all the remaining computation.
      
      The fix is to add a new floor() function in Time.scala to respect the zeroTime while performing the floor :
      
        def floor(that: Duration, zeroTime: Time): Time = {
          val t = that.milliseconds
          new Time(((this.millis - zeroTime.milliseconds) / t) * t + zeroTime.milliseconds)
        }
      
      And then change the DStream.slice to call this new floor function by passing in its zeroTime.
      
          val alignedToTime = toTime.floor(slideDuration, zeroTime)
          val alignedFromTime = fromTime.floor(slideDuration, zeroTime)
      
      This way the alignedToTime and alignedFromTime are *really* aligned in respect to zeroTime whose value is not really a 0.
      
      Author: Wesley Miao <wesley.miao@gmail.com>
      Author: Wesley <wesley.miao@autodesk.com>
      
      Closes #5871 from wesleymiao/spark-7326 and squashes the following commits:
      
      82a4d8c [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream dosen't work all the time
      48b4dc0 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      6ade399 [Wesley] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      2611745 [Wesley Miao] [SPARK-7326] [STREAMING] Performing window() on a WindowedDStream doesn't work all the time
      d70a0768
    • tianyi's avatar
      [SPARK-7519] [SQL] fix minor bugs in thrift server UI · 2242ab31
      tianyi authored
      Bugs description:
      
      1. There are extra commas on the top of session list.
      2. The format of time in "Start at:" part is not the same as others.
      3. The total number of online sessions is wrong.
      
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #6048 from tianyi/SPARK-7519 and squashes the following commits:
      
      ed366b7 [tianyi] fix bug
      2242ab31
  3. May 10, 2015
    • Shivaram Venkataraman's avatar
      [SPARK-7512] [SPARKR] Fix RDD's show method to use getJRDD · 0835f1ed
      Shivaram Venkataraman authored
      Since the RDD object might be a Pipelined RDD we should use `getJRDD` to get the right handle to the Java object.
      
      Fixes the bug reported at
      http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      
      cc concretevitamin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6035 from shivaram/sparkr-show-bug and squashes the following commits:
      
      d70145c [Shivaram Venkataraman] Fix RDD's show method to use getJRDD Fixes the bug reported at http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
      0835f1ed
    • Glenn Weidner's avatar
      [SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python · c5aca0c2
      Glenn Weidner authored
      Modified 2 files:
      python/pyspark/ml/param/_shared_params_code_gen.py
      python/pyspark/ml/param/shared.py
      
      Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
      python _shared_params_code_gen.py > shared.py
      
      Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala.  Note warning was displayed when committing shared.py:
      warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
      
      Author: Glenn Weidner <gweidner@us.ibm.com>
      
      Closes #6023 from gweidner/br-7427 and squashes the following commits:
      
      db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
      c5aca0c2
    • Kirill A. Korinskiy's avatar
      [SPARK-5521] PCA wrapper for easy transform vectors · 8c07c75c
      Kirill A. Korinskiy authored
      I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure.
      
      Example of usage:
      ```
        import org.apache.spark.mllib.regression.LinearRegressionWithSGD
        import org.apache.spark.mllib.regression.LabeledPoint
        import org.apache.spark.mllib.linalg.Vectors
        import org.apache.spark.mllib.feature.PCA
      
        val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line =>
          val parts = line.split(',')
          LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
        }.cache()
      
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)
      
        val pca = PCA.create(training.first().features.size/2, data.map(_.features))
        val training_pca = training.map(p => p.copy(features = pca.transform(p.features)))
        val test_pca = test.map(p => p.copy(features = pca.transform(p.features)))
      
        val numIterations = 100
        val model = LinearRegressionWithSGD.train(training, numIterations)
        val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations)
      
        val valuesAndPreds = test.map { point =>
          val score = model.predict(point.features)
          (score, point.label)
        }
      
        val valuesAndPreds_pca = test_pca.map { point =>
          val score = model_pca.predict(point.features)
          (score, point.label)
        }
      
        val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
        val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean()
      
        println("Mean Squared Error = " + MSE)
        println("PCA Mean Squared Error = " + MSE_pca)
      ```
      
      Author: Kirill A. Korinskiy <catap@catap.ru>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4304 from catap/pca and squashes the following commits:
      
      501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit().  In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA.
      9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style
      1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors
      8c07c75c
    • Joseph K. Bradley's avatar
      [SPARK-7431] [ML] [PYTHON] Made CrossValidatorModel call parent init in PySpark · 3038443e
      Joseph K. Bradley authored
      Fixes bug with PySpark cvModel not having UID
      Also made small PySpark fixes: Evaluator should inherit from Params.  MockModel should inherit from Model.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
      
      57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
      3038443e
Loading