Skip to content
Snippets Groups Projects
  1. Apr 25, 2015
  2. Apr 24, 2015
    • Deborah Siegel's avatar
      [SPARK-7136][Docs] Spark SQL and DataFrame Guide fix example file and paths · 59b7cfc4
      Deborah Siegel authored
      Changes example file for Generic Load/Save Functions to users.parquet rather than people.parquet which doesn't exist unless a later example has already been executed. Also adds filepaths.
      
      Author: Deborah Siegel <deborah.siegel@gmail.com>
      Author: DEBORAH SIEGEL <deborahsiegel@d-140-142-0-49.dhcp4.washington.edu>
      Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>
      Author: DEBORAH SIEGEL <deborahsiegel@d-69-91-154-197.dhcp4.washington.edu>
      
      Closes #5693 from d3borah/master and squashes the following commits:
      
      4d5e43b [Deborah Siegel] sparkSQL doc change
      b15a497 [Deborah Siegel] Revert "sparkSQL doc change"
      5a2863c [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
      91972fc [DEBORAH SIEGEL] sparkSQL doc change
      f000e59 [DEBORAH SIEGEL] Merge remote-tracking branch 'upstream/master'
      db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc
      59b7cfc4
    • linweizhong's avatar
      [PySpark][Minor] Update sql example, so that can read file correctly · d874f8b5
      linweizhong authored
      To run Spark, default will read file from HDFS if we don't set the schema.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5684 from Sephiroth-Lin/pyspark_example_minor and squashes the following commits:
      
      19fe145 [linweizhong] Update example sql.py, so that can read file correctly
      d874f8b5
    • Calvin Jia's avatar
      [SPARK-6122] [CORE] Upgrade tachyon-client version to 0.6.3 · 438859eb
      Calvin Jia authored
      This is a reopening of #4867.
      A short summary of the issues resolved from the previous PR:
      
      1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.
      
      2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.
      
      It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!
      
      Author: Calvin Jia <jia.calvin@gmail.com>
      
      Closes #5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:
      
      0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
      7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
      9263097 [Calvin Jia] Merge master to test latest changes
      dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
      e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
      a3a29da [Calvin Jia] Update tachyon-client exclusions.
      0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
      a204df9 [Calvin Jia] Update make distribution tachyon version.
      a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
      a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
      910fabd [Calvin Jia] Update to master
      eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
      11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
      71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.
      438859eb
    • Sun Rui's avatar
      [SPARK-6852] [SPARKR] Accept numeric as numPartitions in SparkR. · caf0136e
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5613 from sun-rui/SPARK-6852 and squashes the following commits:
      
      abaf02e [Sun Rui] Change the type of default numPartitions from integer to numeric in generics.R.
      29d67c1 [Sun Rui] [SPARK-6852][SPARKR] Accept numeric as numPartitions in SparkR.
      caf0136e
    • Sun Rui's avatar
      [SPARK-7033] [SPARKR] Clean usage of split. Use partition instead where applicable. · ebb77b2a
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5628 from sun-rui/SPARK-7033 and squashes the following commits:
      
      046bc9e [Sun Rui] Clean split usage in tests.
      d531c86 [Sun Rui] [SPARK-7033][SPARKR] Clean usage of split. Use partition instead where applicable.
      ebb77b2a
    • Xusen Yin's avatar
      [SPARK-6528] [ML] Add IDF transformer · 6e57d57b
      Xusen Yin authored
      See [SPARK-6528](https://issues.apache.org/jira/browse/SPARK-6528). Add IDF transformer in ML package.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5266 from yinxusen/SPARK-6528 and squashes the following commits:
      
      741db31 [Xusen Yin] get param from new paramMap
      d169967 [Xusen Yin] add final to param and IDF class
      c9c3759 [Xusen Yin] simplify test suite
      5867c09 [Xusen Yin] refine IDF transformer with new interfaces
      7727cae [Xusen Yin] Merge branch 'master' into SPARK-6528
      4338a37 [Xusen Yin] Merge branch 'master' into SPARK-6528
      aef2cdf [Xusen Yin] add doc and group for param
      5760b49 [Xusen Yin] fix code style
      2add691 [Xusen Yin] fix code style and test
      03fbecb [Xusen Yin] remove duplicated code
      2aa4be0 [Xusen Yin] clean test suite
      4802c67 [Xusen Yin] add IDF transformer and test suite
      6e57d57b
    • Xiangrui Meng's avatar
      [SPARK-7115] [MLLIB] skip the very first 1 in poly expansion · 78b39c7e
      Xiangrui Meng authored
      yinxusen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5681 from mengxr/SPARK-7115 and squashes the following commits:
      
      9ac27cd [Xiangrui Meng] skip the very first 1 in poly expansion
      78b39c7e
    • Xusen Yin's avatar
      [SPARK-5894] [ML] Add polynomial mapper · 8509519d
      Xusen Yin authored
      See [SPARK-5894](https://issues.apache.org/jira/browse/SPARK-5894).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5245 from yinxusen/SPARK-5894 and squashes the following commits:
      
      dc461a6 [Xusen Yin] merge polynomial expansion v2
      6d0c3cc [Xusen Yin] Merge branch 'SPARK-5894' of https://github.com/mengxr/spark into mengxr-SPARK-5894
      57bfdd5 [Xusen Yin] Merge branch 'master' into SPARK-5894
      3d02a7d [Xusen Yin] Merge branch 'master' into SPARK-5894
      a067da2 [Xiangrui Meng] a new approach for poly expansion
      0789d81 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5894
      4e9aed0 [Xusen Yin] fix test suite
      95d8fb9 [Xusen Yin] fix sparse vector indices
      8d39674 [Xusen Yin] fix sparse vector expansion error
      5998dd6 [Xusen Yin] fix dense vector fillin
      fa3ade3 [Xusen Yin] change the functional code into imperative one to speedup
      b70e7e1 [Xusen Yin] remove useless case class
      6fa236f [Xusen Yin] fix vector slice error
      daff601 [Xusen Yin] fix index error of sparse vector
      6bd0a10 [Xusen Yin] merge repeated features
      419f8a2 [Xusen Yin] need to merge same columns
      4ebf34e [Xusen Yin] add test suite of polynomial expansion
      372227c [Xusen Yin] add polynomial expansion
      8509519d
    • Reynold Xin's avatar
      Fixed a typo from the previous commit. · 4c722d77
      Reynold Xin authored
      4c722d77
  3. Apr 23, 2015
    • Reynold Xin's avatar
      [SQL] Fixed expression data type matching. · d3a302de
      Reynold Xin authored
      Also took the chance to improve documentation for various types.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5675 from rxin/data-type-matching-expr and squashes the following commits:
      
      0f31856 [Reynold Xin] One more function documentation.
      27c1973 [Reynold Xin] Added more documentation.
      336a36d [Reynold Xin] [SQL] Fixed expression data type matching.
      d3a302de
    • Ken Geis's avatar
      Update sql-programming-guide.md · 67bccbda
      Ken Geis authored
      fix typo
      
      Author: Ken Geis <geis.ken@gmail.com>
      
      Closes #5674 from kgeis/patch-1 and squashes the following commits:
      
      5ae67de [Ken Geis] Update sql-programming-guide.md
      67bccbda
    • Yin Huai's avatar
      [SPARK-7060][SQL] Add alias function to python dataframe · 2d010f7a
      Yin Huai authored
      This pr tries to provide a way to let python users workaround https://issues.apache.org/jira/browse/SPARK-6231.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5634 from yhuai/pythonDFAlias and squashes the following commits:
      
      8465acd [Yin Huai] Add an alias to a Python DF.
      2d010f7a
    • Cheolsoo Park's avatar
      [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in... · 336f7f53
      Cheolsoo Park authored
      [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit
      
      When specifying non-spark properties (i.e. names don't start with spark.) in the command line and config file, spark-submit and spark-shell behave differently, causing confusion to users.
      Here is the summary-
      * spark-submit
        * --conf k=v => silently ignored
        * spark-defaults.conf => applied
      * spark-shell
        * --conf k=v => show a warning message and ignored
        *  spark-defaults.conf => show a warning message and ignored
      
      I assume that ignoring non-spark properties is intentional. If so, it should always be ignored with a warning message in all cases.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #5617 from piaozhexiu/SPARK-7037 and squashes the following commits:
      
      8957950 [Cheolsoo Park] Add IgnoreNonSparkProperties method
      fedd01c [Cheolsoo Park] Ignore non-spark properties with a warning message in all cases
      336f7f53
    • Sun Rui's avatar
      [SPARK-6818] [SPARKR] Support column deletion in SparkR DataFrame API. · 73db132b
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5655 from sun-rui/SPARK-6818 and squashes the following commits:
      
      7c66570 [Sun Rui] [SPARK-6818][SPARKR] Support column deletion in SparkR DataFrame API.
      73db132b
    • Reynold Xin's avatar
      [SQL] Break dataTypes.scala into multiple files. · 6220d933
      Reynold Xin authored
      It was over 1000 lines of code, making it harder to find all the types. Only moved code around, and didn't change any.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5670 from rxin/break-types and squashes the following commits:
      
      8c59023 [Reynold Xin] Check in missing files.
      dcd5193 [Reynold Xin] [SQL] Break dataTypes.scala into multiple files.
      6220d933
    • Xiangrui Meng's avatar
      [SPARK-7070] [MLLIB] LDA.setBeta should call setTopicConcentration. · 1ed46a60
      Xiangrui Meng authored
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5649 from mengxr/SPARK-7070 and squashes the following commits:
      
      c66023c [Xiangrui Meng] setBeta should call setTopicConcentration
      1ed46a60
    • Tijo Thomas's avatar
      [SPARK-7087] [BUILD] Fix path issue change version script · 6d0749ca
      Tijo Thomas authored
      Author: Tijo Thomas <tijoparacka@gmail.com>
      
      Closes #5656 from tijoparacka/FIX_PATHISSUE_CHANGE_VERSION_SCRIPT and squashes the following commits:
      
      ab4f4b1 [Tijo Thomas] removed whitespace
      24478c9 [Tijo Thomas] modified to provide the spark base dir while searching for pom and also while changing the vesrion no
      7b8e10b [Tijo Thomas] Modified for providing the base directories while finding the list of pom files and also while changing the version no
      6d0749ca
    • WangTaoTheTonic's avatar
      [SPARK-6879] [HISTORYSERVER] check if app is completed before clean it up · baa83a9a
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-6879
      
      Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
      If an exception was throwed, add it to `applications` to wait for the next loop.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:
      
      4a533eb [WangTaoTheTonic] treat ACE specially
      cb45105 [WangTaoTheTonic] rebase
      d4d5251 [WangTaoTheTonic] per Marcelo's comments
      d7455d8 [WangTaoTheTonic] slightly change when delete file
      b0abca5 [WangTaoTheTonic] use global var to store apps to clean
      94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
      9872a9d [WangTaoTheTonic] use the right path
      fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
      baa83a9a
    • wizz's avatar
      [SPARK-7085][MLlib] Fix miniBatchFraction parameter in train method called with 4 arguments · 3e91cc27
      wizz authored
      Author: wizz <wizz@wizz-dev01.kawasaki.flab.fujitsu.com>
      
      Closes #5658 from kuromatsu-nobuyuki/SPARK-7085 and squashes the following commits:
      
      6ec2d21 [wizz] Fix miniBatchFraction parameter in train method called with 4 arguments
      3e91cc27
    • Josh Rosen's avatar
      [SPARK-7058] Include RDD deserialization time in "task deserialization time" metric · 6afde2c7
      Josh Rosen authored
      The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5635 from JoshRosen/SPARK-7058 and squashes the following commits:
      
      ed90f75 [Josh Rosen] Update UI tooltip
      a3743b4 [Josh Rosen] Update comments.
      4f52910 [Josh Rosen] Roll back whitespace change
      e9cf9f4 [Josh Rosen] Remove unused variable
      9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
      21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
      1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
      6afde2c7
    • Vinod K C's avatar
      [SPARK-7055][SQL]Use correct ClassLoader for JDBC Driver in JDBCRDD.getConnector · c1213e6a
      Vinod K C authored
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5633 from vinodkc/use_correct_classloader_driverload and squashes the following commits:
      
      73c5380 [Vinod K C] Use correct ClassLoader for JDBC Driver
      c1213e6a
    • Tathagata Das's avatar
      [SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint... · 534f2a43
      Tathagata Das authored
      [SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
      
      Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated.
      
      The solution in this PR: Introduce the following methods on StreamingContext
      1. `new StreamingContext(checkpointDirectory, sparkContext)`
         Recreate StreamingContext from checkpoint using the provided SparkContext
      2. `StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext => StreamingContext)`
         If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction
      
      TODO: the corresponding Java and Python API has to be added as well.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5428 from tdas/SPARK-6752 and squashes the following commits:
      
      94db63c [Tathagata Das] Fix long line.
      524f519 [Tathagata Das] Many changes based on PR comments.
      eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
      36a7823 [Tathagata Das] Minor changes.
      204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
      534f2a43
    • Cheng Hao's avatar
      [SPARK-7044] [SQL] Fix the deadlock in script transformation · cc48e638
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5625 from chenghao-intel/transform and squashes the following commits:
      
      5ec1dd2 [Cheng Hao] fix the deadlock issue in ScriptTransform
      cc48e638
    • Prabeesh K's avatar
      [minor][streaming]fixed scala string interpolation error · 975f53e4
      Prabeesh K authored
      Author: Prabeesh K <prabeesh.k@namshi.com>
      
      Closes #5653 from prabeesh/fix and squashes the following commits:
      
      9d7a9f5 [Prabeesh K] fixed scala string interpolation error
      975f53e4
    • Prashant Sharma's avatar
      [HOTFIX] [SQL] Fix compilation for scala 2.11. · a7d65d38
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #5652 from ScrapCodes/hf/compilation-fix-scala-2.11 and squashes the following commits:
      
      819ff06 [Prashant Sharma] [HOTFIX] Fix compilation for scala 2.11.
      a7d65d38
    • Reynold Xin's avatar
      [SPARK-7069][SQL] Rename NativeType -> AtomicType. · f60bece1
      Reynold Xin authored
      Also renamed JvmType to InternalType.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5651 from rxin/native-to-atomic-type and squashes the following commits:
      
      cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.
      f60bece1
    • Reynold Xin's avatar
      [SPARK-7068][SQL] Remove PrimitiveType · 29163c52
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5646 from rxin/remove-primitive-type and squashes the following commits:
      
      01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType
      29163c52
    • Reynold Xin's avatar
      [MLlib] Add support for BooleanType to VectorAssembler. · 2d33323c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5648 from rxin/vectorAssembler-boolean and squashes the following commits:
      
      1bf3d40 [Reynold Xin] [MLlib] Add support for BooleanType to VectorAssembler.
      2d33323c
    • Liang-Chi Hsieh's avatar
      [HOTFIX][SQL] Fix broken cached test · d9e70f33
      Liang-Chi Hsieh authored
      Added in #5475. Pointed as broken in #5639.
      /cc marmbrus
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5640 from viirya/fix_cached_test and squashes the following commits:
      
      c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
      d9e70f33
  4. Apr 22, 2015
    • Kay Ousterhout's avatar
      [SPARK-7046] Remove InputMetrics from BlockResult · 03e85b4a
      Kay Ousterhout authored
      This is a code cleanup.
      
      The BlockResult class originally contained an InputMetrics object so that InputMetrics could
      directly be used as the InputMetrics for the whole task. Now we copy the fields out of here, and
      the presence of this object is confusing because it's only a partial input metrics (it doesn't
      include the records read). Because this object is no longer useful (and is confusing), it should
      be removed.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #5627 from kayousterhout/SPARK-7046 and squashes the following commits:
      
      bf64bbe [Kay Ousterhout] Import fix
      a08ca19 [Kay Ousterhout] [SPARK-7046] Remove InputMetrics from BlockResult
      03e85b4a
    • Reynold Xin's avatar
      [SPARK-7066][MLlib] VectorAssembler should use NumericType not NativeType. · d2068606
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5642 from rxin/mllib-native-type and squashes the following commits:
      
      e23af5b [Reynold Xin] Remove StringType
      7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.
      d2068606
    • Reynold Xin's avatar
      [MLlib] UnaryTransformer nullability should not depend on PrimitiveType. · 1b85e085
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5644 from rxin/mllib-nullable and squashes the following commits:
      
      a727e5b [Reynold Xin] [MLlib] UnaryTransformer nullability should not depend on primitive types.
      1b85e085
    • Reynold Xin's avatar
    • Daoyuan Wang's avatar
      [SPARK-6967] [SQL] fix date type convertion in jdbcrdd · 04525c07
      Daoyuan Wang authored
      This pr convert java.sql.Date type into Int for JDBCRDD.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5590 from adrian-wang/datebug and squashes the following commits:
      
      f897b81 [Daoyuan Wang] add a test case
      3c9184c [Daoyuan Wang] fix date type convertion in jdbcrdd
      04525c07
    • Yanbo Liang's avatar
      [SPARK-6827] [MLLIB] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API · f4f39981
      Yanbo Liang authored
      Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
      It return a RDD with each tuple is composed of an array and a long value.
      I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5614 from yanboliang/spark-6827 and squashes the following commits:
      
      da8c404 [Yanbo Liang] use namedtuple
      5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
      f4f39981
    • Reynold Xin's avatar
      [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin. · baf865dd
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5638 from rxin/joinUsing and squashes the following commits:
      
      13e9cc9 [Reynold Xin] Code review + Python.
      b1bd914 [Reynold Xin] [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin and self join.
      baf865dd
Loading