Skip to content
Snippets Groups Projects
  1. May 12, 2015
    • zsxwing's avatar
      [HOTFIX] Use the old Job API to support old Hadoop versions · 247b7034
      zsxwing authored
      #5526 uses `Job.getInstance`, which does not exist in the old Hadoop versions. Just use `new Job` to replace it.
      
      cc liancheng
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6095 from zsxwing/hotfix and squashes the following commits:
      
      b0c2049 [zsxwing] Use the old Job API to support old Hadoop versions
      247b7034
    • Xiangrui Meng's avatar
      [SPARK-7572] [MLLIB] do not import Param/Params under pyspark.ml · 77f64c73
      Xiangrui Meng authored
      Remove `Param` and `Params` from `pyspark.ml` and add a section in the doc. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6094 from mengxr/SPARK-7572 and squashes the following commits:
      
      022abd6 [Xiangrui Meng] do not import Param/Params under spark.ml
      77f64c73
    • Tathagata Das's avatar
      [SPARK-7554] [STREAMING] Throw exception when an active/stopped... · 23f7d66d
      Tathagata Das authored
      [SPARK-7554] [STREAMING] Throw exception when an active/stopped StreamingContext is used to create DStreams and output operations
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6099 from tdas/SPARK-7554 and squashes the following commits:
      
      2cd4158 [Tathagata Das] Throw exceptions on attempts to add stuff to active and stopped contexts.
      23f7d66d
    • Xiangrui Meng's avatar
      [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly · 2713bc65
      Xiangrui Meng authored
      `RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:
      
      e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly
      2713bc65
    • Tathagata Das's avatar
      [SPARK-7553] [STREAMING] Added methods to maintain a singleton StreamingContext · 00e7b09a
      Tathagata Das authored
      In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands
      ```
      val ssc = new StreamingContext(...) // cmd 1
      ssc.start() // cmd 2
      ...
      val ssc = new StreamingContext(...) // accidentally run cmd 1 again
      ```
      The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost).
      Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context.
      Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6070 from tdas/SPARK-7553 and squashes the following commits:
      
      731c9a1 [Tathagata Das] Fixed style
      a797171 [Tathagata Das] Added more unit tests
      19fc70b [Tathagata Das] Added :: Experimental :: in docs
      64706c9 [Tathagata Das] Fixed test
      634db5d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7553
      3884a25 [Tathagata Das] Fixing test bug
      d37a846 [Tathagata Das] Added getActive and getActiveOrCreate
      00e7b09a
    • Joseph K. Bradley's avatar
      [SPARK-7573] [ML] OneVsRest cleanups · 96c4846d
      Joseph K. Bradley authored
      Minor cleanups discussed with [~mengxr]:
      * move OneVsRest from reduction to classification sub-package
      * make model constructor private
      
      Some doc cleanups too
      
      CC: harsha2010  Could you please verify this looks OK?  Thanks!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:
      
      4ecd48d [Joseph K. Bradley] org imports
      430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
      9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.
      96c4846d
    • Joseph K. Bradley's avatar
      [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer · f0c1bc34
      Joseph K. Bradley authored
      Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.
      
      I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:
      
      d5d213f [Joseph K. Bradley] small fix
      dd6e91a [Joseph K. Bradley] fixes from code review of user guide
      33c3ff9 [Joseph K. Bradley] small fix
      bc6058c [Joseph K. Bradley] fix link
      361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
      f0c1bc34
    • Yuhao Yang's avatar
      [SPARK-7496] [MLLIB] Update Programming guide with Online LDA · 1d703660
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7496
      
      Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:
      
      4b6fbfa [Yuhao Yang] add online paper and some comparison
      fd4c983 [Yuhao Yang] update lda document for optimizers
      1d703660
    • zsxwing's avatar
      [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay",... · 1422e79e
      zsxwing authored
      [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
      
      Screenshots:
      ![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png)
      
      ![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png)
      
      tdas as we discussed offline
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits:
      
      2b004ea [zsxwing] Merge branch 'master' into SPARK-7406
      e9eb506 [zsxwing] Update tooltip contents
      2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
      1422e79e
    • Xiangrui Meng's avatar
      [SPARK-7571] [MLLIB] rename Math to math · a4874b0d
      Xiangrui Meng authored
      `scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:
      
      fe8f8d3 [Xiangrui Meng] Math -> math
      a4874b0d
    • Venkata Ramana Gollamudi's avatar
      [SPARK-7484][SQL]Support jdbc connection properties · 455551d1
      Venkata Ramana Gollamudi authored
      Few jdbc drivers like SybaseIQ support passing username and password only through connection properties. So the same needs to be supported for
      SQLContext.jdbc, dataframe.createJDBCTable and dataframe.insertIntoJDBC.
      Added as default arguments or overrided function to support backward compatability.
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #6009 from gvramana/add_jdbc_conn_properties and squashes the following commits:
      
      396a0d0 [Venkata Ramana Gollamudi] fixed comments
      d66dd8c [Venkata Ramana Gollamudi] fixed comments
      1b8cd8c [Venkata Ramana Gollamudi] Support jdbc connection properties
      455551d1
    • Xiangrui Meng's avatar
      [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket. · 23b9863e
      Xiangrui Meng authored
      We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11).
      
      This also update the impl to use `Arrays.binarySearch` and `withClue` in test.
      
      yinxusen jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6075 from mengxr/SPARK-7559 and squashes the following commits:
      
      e28f910 [Xiangrui Meng] update bucketizer impl
      23b9863e
    • Michael Armbrust's avatar
      [SPARK-7569][SQL] Better error for invalid binary expressions · 2a41c0d7
      Michael Armbrust authored
      `scala> Seq((1,1)).toDF("a", "b").select(lit(1) + new java.sql.Date(1)) `
      
      Before:
      
      ```
      org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType;
      ```
      
      After:
      ```
      org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between int and date;
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6089 from marmbrus/betterBinaryError and squashes the following commits:
      
      23b68ad [Michael Armbrust] [SPARK-7569][SQL] Better error for invalid binary expressions
      2a41c0d7
    • Ram Sriharsha's avatar
      [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All · 595a6758
      Ram Sriharsha authored
      initial cut of one against all. test code is a scaffolding , not fully implemented.
      This WIP is to gather early feedback.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #5830 from harsha2010/reduction and squashes the following commits:
      
      5f4b495 [Ram Sriharsha] Fix Test
      386e98b [Ram Sriharsha] Style fix
      49b4a17 [Ram Sriharsha] Simplify the test
      02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
      bc78032 [Ram Sriharsha] Code Review Updates
      8ce4845 [Ram Sriharsha] Merge with Master
      2a807be [Ram Sriharsha] Merge branch 'master' into reduction
      e21bfcc [Ram Sriharsha] Style Fix
      5614f23 [Ram Sriharsha] Style Fix
      c75583a [Ram Sriharsha] Cleanup
      7a5f136 [Ram Sriharsha] Fix TODOs
      804826b [Ram Sriharsha] Merge with Master
      1448a5f [Ram Sriharsha] Style Fix
      6e47807 [Ram Sriharsha] Style Fix
      d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
      ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
      78fa82a [Ram Sriharsha] extra line
      0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute
      a59a4f4 [Ram Sriharsha] @Experimental
      4167234 [Ram Sriharsha] Merge branch 'master' into reduction
      868a4fd [Ram Sriharsha] @Experimental
      041d905 [Ram Sriharsha] Code Review Fixes
      df188d8 [Ram Sriharsha] Style fix
      612ec48 [Ram Sriharsha] Style Fix
      6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner
      6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
      e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
      1c7fa44 [Ram Sriharsha] Fix Tests
      ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier
      221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers
      26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
      9738744 [Ram Sriharsha] Merge branch 'master' into reduction
      1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically
      32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
      ff272da [Ram Sriharsha] Style fix
      28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
      b60f874 [Ram Sriharsha] Fix Style issues in Test
      3191cdf [Ram Sriharsha] Remove this test, accidental commit
      23f056c [Ram Sriharsha] Fix Headers for test
      1b5e929 [Ram Sriharsha] Fix Style issues and add Header
      8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All
      595a6758
    • Tim Ellison's avatar
      [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization p… · 5438f49c
      Tim Ellison authored
      …roblem
      
      Pick up newer version of dependency with fix for SPARK-2018.  The update involved patching the ning/compress LZF library to handle big endian systems correctly.
      
      Credit goes to gireeshpunathil for diagnosing the problem, and cowtowncoder for fixing it.
      
      Spark tests run clean for me.
      
      Author: Tim Ellison <t.p.ellison@gmail.com>
      
      Closes #6077 from tellison/UpgradeLZF and squashes the following commits:
      
      ad8d4ef [Tim Ellison] [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization problem
      5438f49c
    • Burak Yavuz's avatar
      [SPARK-7487] [ML] Feature Parity in PySpark for ml.regression · 8e935b0a
      Burak Yavuz authored
      Added LinearRegression Python API
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6016 from brkyvz/ml-reg and squashes the following commits:
      
      11c9ef9 [Burak Yavuz] address comments
      1027a40 [Burak Yavuz] fix typo
      4c699ad [Burak Yavuz] added tree regressor api
      8afead2 [Burak Yavuz] made mixin for DT
      fa51c74 [Burak Yavuz] save additions
      0640d48 [Burak Yavuz] added ml.regression
      82aac48 [Burak Yavuz] added linear regression
      8e935b0a
    • Andrew Or's avatar
      b9b01f44
    • Wenchen Fan's avatar
      [SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing Project · 4e290522
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5831 from cloud-fan/7276 and squashes the following commits:
      
      ee4a1e1 [Wenchen Fan] fix rebase mistake
      a3b565d [Wenchen Fan] refactor
      99deb5d [Wenchen Fan] add test
      f1f67ad [Wenchen Fan] fix 7276
      4e290522
    • Andrew Or's avatar
      [SPARK-7500] DAG visualization: move cluster labeling to dagre-d3 · 65697bbe
      Andrew Or authored
      This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular:
      
      (1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark
      (2) All graph styling is now moved from Scala to JS
      
      Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3.
      
      -------------------
      
      **Before.**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/>
      
      -------------------
      
      **After.**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits:
      
      5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed
      c686dc4 [Andrew Or] Fix tooltip placement
      d908c36 [Andrew Or] Add link to dagre-d3 changes (minor)
      4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS
      65697bbe
    • Wenchen Fan's avatar
      [DataFrame][minor] support column in field accessor · bfcaf8ad
      Wenchen Fan authored
      Minor improvement, now we can use `Column` as extraction expression.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6080 from cloud-fan/tmp and squashes the following commits:
      
      0fdefb7 [Wenchen Fan] support column in field accessor
      bfcaf8ad
    • Cheng Lian's avatar
      [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API · 0595b6de
      Cheng Lian authored
      This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.
      
      ## New features provided
      
      1. Hive compatible partition discovery
      
         This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.
      
      1. Generalized partition pruning optimization
      
         Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.
      
         (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)
      
      1. Insertion with dynamic partitions
      
         When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.
      
      ## New structures provided
      
      ### Developer API
      
      1. `FSBasedRelation`
      
         Base abstract class for file system based data sources.
      
      1. `OutputWriter`
      
         Base abstract class for output row writers, responsible for writing a single row object.
      
      1. `FSBasedRelationProvider`
      
         A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.
      
      ### User API
      
      New overloaded versions of
      
      1. `DataFrame.save()`
      1. `DataFrame.saveAsTable()`
      1. `SQLContext.load()`
      
      are provided to allow users to save/load DataFrames with user defined dynamic partition columns.
      
      ### Spark SQL query planning
      
      1. `InsertIntoFSBasedRelation`
      
         Used to implement write path for `FSBasedRelation`s.
      
      1. New rules for `FSBasedRelation` in `DataSourceStrategy`
      
         These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.
      
      ## TODO
      
      - [ ] Use scratch directories when overwriting a table with data selected from itself.
      
            Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.
      
      - [ ] When inserting with dynamic partition columns, use external sorter to group the data first.
      
            This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.
      
      - [x] More tests.  Specifically, test cases for
      
            - [x] Self-join
            - [x] Loading partitioned relations with a subset of partition columns stored in data files.
            - [x] `SQLContext.load()` with user defined dynamic partition columns.
      
      ## Parquet data source migration
      
      Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5526 from liancheng/partitioning-support and squashes the following commits:
      
      5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
      1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
      43ba50e [Cheng Lian] Avoids serializing generated projection code
      edf49e7 [Cheng Lian] Removed commented stale code block
      348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
      ad4d4de [Cheng Lian] Enables HDFS style globbing
      8d12e69 [Cheng Lian] Fixes compilation error
      c71ac6c [Cheng Lian] Addresses comments from @marmbrus
      7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
      0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
      52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
      c466de6 [Cheng Lian] Addresses comments
      bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
      795920a [Cheng Lian] Fixes compilation error after rebasing
      0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
      fa543f3 [Cheng Lian] Addresses comments
      5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
      51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
      c4ed4fe [Cheng Lian] Bug fixes and a new test suite
      a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
      5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
      54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
      be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
      0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
      f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
      422ff4a [Cheng Lian] Fixes style issue
      ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
      8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
      ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
      f18dec2 [Cheng Lian] More strict schema checking
      b746ab5 [Cheng Lian] More tests
      9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
      ea6c8dd [Cheng Lian] Removes remote debugging stuff
      327bb1d [Cheng Lian] Implements partitioning support for data sources API
      3c5073a [Cheng Lian] Fixes SaveModes used in test cases
      fb5a607 [Cheng Lian] Fixes compilation error
      9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
      5de194a [Cheng Lian] Forgot Apache licence header
      95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
      770b5ba [Cheng Lian] Adds tests for FSBasedRelation
      3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
      1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
      aa8ba9a [Cheng Lian] Javadoc fix
      012ed2d [Cheng Lian] Adds PartitioningOptions
      7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
      0595b6de
    • Wenchen Fan's avatar
      [DataFrame][minor] cleanup unapply methods in DataTypes · 831504cf
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6079 from cloud-fan/unapply and squashes the following commits:
      
      40da442 [Wenchen Fan] one more
      7d90a05 [Wenchen Fan] cleanup unapply in DataTypes
      831504cf
    • Daoyuan Wang's avatar
      [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark · d86ce845
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #6003 from adrian-wang/pynareplace and squashes the following commits:
      
      672efba [Daoyuan Wang] remove py2.7 feature
      4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests
      9e232e7 [Daoyuan Wang] rename scala map
      af0268a [Daoyuan Wang] remove na
      63ac579 [Daoyuan Wang] add na.replace in pyspark
      d86ce845
    • Tathagata Das's avatar
      [SPARK-7532] [STREAMING] StreamingContext.start() made to logWarning and not throw exception · ec6f2a97
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6060 from tdas/SPARK-7532 and squashes the following commits:
      
      6fe2e83 [Tathagata Das] Update docs
      7dadfc3 [Tathagata Das] Fixed bug again
      99c7678 [Tathagata Das] Added logInfo
      65aec20 [Tathagata Das] Fix bug
      5bf031b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7532
      1a9a818 [Tathagata Das] Fix scaladoc
      c584313 [Tathagata Das] StreamingContext.start() made to logWarning and not throw exception
      ec6f2a97
    • Andrew Or's avatar
      [SPARK-7467] Dag visualization: treat checkpoint as an RDD operation · f3e8e600
      Andrew Or authored
      Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits:
      
      9217439 [Andrew Or] Fix checkpoints
      4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint
      19bc07b [Andrew Or] Treat checkpoint as an RDD operation
      f3e8e600
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
    • linweizhong's avatar
      [MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark · 98478752
      linweizhong authored
      As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits:
      
      8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
      98478752
    • zsxwing's avatar
      [SPARK-7534] [CORE] [WEBUI] Fix the Stage table when a stage is missing · 8a4edecc
      zsxwing authored
      Just improved the Stage table when a stage is missing.
      
      Before:
      
      ![screen shot 2015-05-11 at 10 11 51 am](https://cloud.githubusercontent.com/assets/1000778/7570842/2ba37380-f7c8-11e4-9b5f-cf1a6264b2a4.png)
      
      After:
      
      ![screen shot 2015-05-11 at 10 26 09 am](https://cloud.githubusercontent.com/assets/1000778/7570848/33703152-f7c8-11e4-81a8-d53dd72d7b8d.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6061 from zsxwing/SPARK-7534 and squashes the following commits:
      
      09fe862 [zsxwing] Leave it blank rather than '-'
      6299197 [zsxwing] Fix the Stage table when a stage is missing
      8a4edecc
    • vidmantas zemleris's avatar
      [SPARK-6994][SQL] Update docs for fetching Row fields by name · 640f63b9
      vidmantas zemleris authored
      add docs for https://issues.apache.org/jira/browse/SPARK-6994
      
      Author: vidmantas zemleris <vidmantas@vinted.com>
      
      Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:
      
      241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name
      640f63b9
    • Reynold Xin's avatar
      [SQL] Rename Dialect -> ParserDialect. · 16696759
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6071 from rxin/parserdialect and squashes the following commits:
      
      ca2eb31 [Reynold Xin] Rename Dialect -> ParserDialect.
      16696759
  2. May 11, 2015
    • Joshi's avatar
      [SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark · b94a9337
      Joshi authored
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:
      
      cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
      62becc1 [Joshi] SPARK-7435: Update to DataFrame
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      b94a9337
    • Reynold Xin's avatar
      [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns. · 028ad4bd
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6068 from rxin/drop-column and squashes the following commits:
      
      9d7d5ec [Reynold Xin] [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.
      028ad4bd
    • Zhongshuai Pei's avatar
      [SPARK-7437] [SQL] Fold "literal in (item1, item2, ..., literal, ...)" into true or false directly · 4b5e1fe9
      Zhongshuai Pei authored
      SQL
      ```
      select key from src where 3 in (4, 5);
      ```
      Before
      ```
      == Optimized Logical Plan ==
      Project [key#12]
       Filter 3 INSET (5,4)
        MetastoreRelation default, src, None
      ```
      
      After
      ```
      == Optimized Logical Plan ==
      LocalRelation [key#228], []
      ```
      
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5972 from DoingDone9/InToFalse and squashes the following commits:
      
      4c722a2 [Zhongshuai Pei] Update predicates.scala
      abe2bbb [Zhongshuai Pei] Update Optimizer.scala
      fa461a5 [Zhongshuai Pei] Update Optimizer.scala
      e34c28a [Zhongshuai Pei] Update predicates.scala
      24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
      35ceb7a [Zhongshuai Pei] Update Optimizer.scala
      36c194e [Zhongshuai Pei] Update Optimizer.scala
      2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
      14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
      f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
      f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
      f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
      34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      4b5e1fe9
    • Cheng Hao's avatar
      [SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS · e35d878b
      Cheng Hao authored
      This is a follow up of #5876 and should be merged after #5876.
      
      Let's wait for unit testing result from Jenkins.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits:
      
      f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
      a8260e8 [Cheng Hao] Update code as feedback
      f4e243f [Cheng Hao] remove the serde setting for SequenceFile
      d166afa [Cheng Hao] style issue
      d25a4aa [Cheng Hao] Add SerDe support for CTAS
      e35d878b
    • Reynold Xin's avatar
      [SPARK-7324] [SQL] DataFrame.dropDuplicates · b6bf4f76
      Reynold Xin authored
      This should also close https://github.com/apache/spark/pull/5870
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6066 from rxin/dropDups and squashes the following commits:
      
      130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
      b6bf4f76
    • Tathagata Das's avatar
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the... · f9c7580a
      Tathagata Das authored
      [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the current state of the context
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6058 from tdas/SPARK-7530 and squashes the following commits:
      
      80ee0e6 [Tathagata Das] STARTED --> ACTIVE
      3da6547 [Tathagata Das] Added synchronized
      dd88444 [Tathagata Das] Added more docs
      e1a8505 [Tathagata Das] Fixed comment length
      89f9980 [Tathagata Das] Change to Java enum and added Java test
      7c57351 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      dd4e702 [Tathagata Das] Addressed comments.
      3d56106 [Tathagata Das] Added Mima excludes
      2b86ba1 [Tathagata Das] Added scala docs.
      1722433 [Tathagata Das] Fixed style
      976b094 [Tathagata Das] Added license
      0585130 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
      e0f0a05 [Tathagata Das] Added getState and exposed StreamingContextState
      f9c7580a
    • Xusen Yin's avatar
      [SPARK-5893] [ML] Add bucketizer · 35fb42a0
      Xusen Yin authored
      JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).
      
      One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,
      
      ```scala
      buckets = Array(-0.5, 0.0, 0.5)
      ```
      
      splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:
      
      dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
      1ca973a [Joseph K. Bradley] one more bucketizer test
      34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
      eacfcfa [Xusen Yin] change ML attribute from splits into buckets
      c3cc770 [Xusen Yin] add more unit test for binary search
      3a16cc2 [Xusen Yin] refine comments and names
      ac77859 [Xusen Yin] fix style error
      fb30d79 [Xusen Yin] fix and test binary search
      2466322 [Xusen Yin] refactor Bucketizer
      11fb00a [Xusen Yin] change it into an Estimator
      998bc87 [Xusen Yin] check buckets
      4024cf1 [Xusen Yin] add test suite
      5fe190e [Xusen Yin] add bucketizer
      35fb42a0
    • Reynold Xin's avatar
      Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket. · 87229c95
      Reynold Xin authored
      So users that are interested in this can track it easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6067 from rxin/SPARK-7550 and squashes the following commits:
      
      ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.
      87229c95
    • Reynold Xin's avatar
      [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames. · 3a9b6997
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6062 from rxin/agg-retain-doc and squashes the following commits:
      
      43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
      3a9b6997
    • madhukar's avatar
      [SPARK-7084] improve saveAsTable documentation · 57255dcd
      madhukar authored
      Author: madhukar <phatak.dev@gmail.com>
      
      Closes #5654 from phatak-dev/master and squashes the following commits:
      
      386f407 [madhukar] #5654 updated for all the methods
      2c997c5 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      00bc819 [madhukar] Merge branch 'master' of https://github.com/apache/spark
      2a802c6 [madhukar] #5654 updated the doc according to comments
      866e8df [madhukar] [SPARK-7084] improve saveAsTable documentation
      57255dcd
Loading