Skip to content
Snippets Groups Projects
  1. May 08, 2015
    • Burak Yavuz's avatar
      [SPARK-7383] [ML] Feature Parity in PySpark for ml.features · f5ff4a84
      Burak Yavuz authored
      Implemented python wrappers for Scala functions that don't exist in `ml.features`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits:
      
      adcca55 [Burak Yavuz] add regex tokenizer to __all__
      b91cb44 [Burak Yavuz] addressed comments
      bd39fd2 [Burak Yavuz] remove addition
      b82bd7c [Burak Yavuz] Parity in PySpark for ml.features
      f5ff4a84
    • Imran Rashid's avatar
      [SPARK-3454] separate json endpoints for data in the UI · c796be70
      Imran Rashid authored
      Exposes data available in the UI as json over http.  Key points:
      
      * new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
      * Uses jersey + jackson for routing & converting POJOs into json
      * tests against known results in `HistoryServerSuite`
      * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:
      
      1a72ed6 [Imran Rashid] rats
      85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
      1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
      1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
      4e12013 [Imran Rashid] just use test case name for expectation file name
      863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
      c796be70
    • Lianhui Wang's avatar
      [SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH · ebff7327
      Lianhui Wang authored
      Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
      andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits:
      
      66ffa43 [Lianhui Wang] Update Client.scala
      c2ad0f9 [Lianhui Wang] Update Client.scala
      1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
      20402cd [Lianhui Wang] use ZipEntry
      9d87c3f [Lianhui Wang] update scala style
      e7bd971 [Lianhui Wang] address vanzin's comments
      4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
      e6b573b [Lianhui Wang] address vanzin's comments
      f11f84a [Lianhui Wang] zip pyspark archives
      5192cca [Lianhui Wang] update import path
      3b1e4c8 [Lianhui Wang] address tgravescs's comments
      9396346 [Lianhui Wang] put zip to make-distribution.sh
      0d2baf7 [Lianhui Wang] update import paths
      e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
      31e8e06 [Lianhui Wang] update code style
      9f31dac [Lianhui Wang] update code and add comments
      f72987c [Lianhui Wang] add archives path to PYTHONPATH
      ebff7327
    • Zhang, Liye's avatar
      [SPARK-7392] [CORE] bugfix: Kryo buffer size cannot be larger than 2M · c2f0821a
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #5934 from liyezhang556520/kryoBufSize and squashes the following commits:
      
      5707e04 [Zhang, Liye] fix import order
      8693288 [Zhang, Liye] replace multiplier with ByteUnit methods
      9bf93e9 [Zhang, Liye] add tests
      d91e5ed [Zhang, Liye] change kb to mb
      c2f0821a
    • wangfei's avatar
      [SPARK-7232] [SQL] Add a Substitution batch for spark sql analyzer · f496bf3c
      wangfei authored
        Added a new batch named `Substitution` before `Resolution` batch. The motivation for this is there are kind of cases we want to do some substitution on the parsed logical plan before resolve it.
      Consider this two cases:
      1 CTE, for cte we first build a row logical plan
      ```
      'With Map(q1 -> 'Subquery q1
                         'Project ['key]
                            'UnresolvedRelation [src], None)
       'Project [*]
        'Filter ('key = 5)
         'UnresolvedRelation [q1], None
      ```
      In `With` logicalplan here is a map stored the (`q1-> subquery`), we want first take off the with command and substitute the  `q1` of `UnresolvedRelation` by the `subquery`
      
      2 Another example is Window function, in window function user may define some windows, we also need substitute the window name of child by the concrete window. this should also done in the Substitution batch.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #5776 from scwf/addbatch and squashes the following commits:
      
      d4b962f [wangfei] added WindowsSubstitution
      70f6932 [wangfei] Merge branch 'master' of https://github.com/apache/spark into addbatch
      ecaeafb [wangfei] address yhuai's comments
      553005a [wangfei] fix test case
      0c54798 [wangfei] address comments
      29aaaaf [wangfei] fix compile
      1c9a092 [wangfei] added Substitution bastch
      f496bf3c
    • Andrew Or's avatar
      [SPARK-7470] [SQL] Spark shell SQLContext crashes without hive · 714db2ef
      Andrew Or authored
      This only happens if you have `SPARK_PREPEND_CLASSES` set. Then I built it with `build/sbt clean assembly compile` and just ran it with `bin/spark-shell`.
      ```
      ...
      15/05/07 17:07:30 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/local-1431043649919
      15/05/07 17:07:30 INFO SparkILoop: Created spark context..
      Spark context available as sc.
      java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf
      	at java.lang.Class.getDeclaredConstructors0(Native Method)
      	at java.lang.Class.privateGetDeclaredConstructors(Class.java:2493)
      	at java.lang.Class.getConstructor0(Class.java:2803)
      ...
      Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      	... 52 more
      
      <console>:10: error: not found: value sqlContext
             import sqlContext.implicits._
                    ^
      <console>:10: error: not found: value sqlContext
             import sqlContext.sql
                    ^
      ```
      yhuai marmbrus
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5997 from andrewor14/sql-shell-crash and squashes the following commits:
      
      61147e6 [Andrew Or] Also expect NoClassDefFoundError
      714db2ef
  2. May 07, 2015
    • Yin Huai's avatar
      [SPARK-6986] [SQL] Use Serializer2 in more cases. · 3af423c9
      Yin Huai authored
      With https://github.com/apache/spark/commit/0a2b15ce43cf6096e1a7ae060b7c8a4010ce3b92, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use `SparkSqlSerializer2` in more cases.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5849 from yhuai/serializer2MoreCases and squashes the following commits:
      
      53a5eaa [Yin Huai] Josh's comments.
      487f540 [Yin Huai] Use BufferedOutputStream.
      8385f95 [Yin Huai] Always create a new row at the deserialization side to work with sort merge join.
      c7e2129 [Yin Huai] Update tests.
      4513d13 [Yin Huai] Use Serializer2 in more places.
      3af423c9
    • Shuo Xiang's avatar
      [SPARK-7452] [MLLIB] fix bug in topBykey and update test · 92f8f803
      Shuo Xiang authored
      the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #5990 from coderxiang/topbykey-test and squashes the following commits:
      
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      92f8f803
    • Michael Armbrust's avatar
      [SPARK-6908] [SQL] Use isolated Hive client · cd1d4110
      Michael Armbrust authored
      This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client.  By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile.  This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
      
      Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
       - a colon-separated list of jar files or directories for hive and hadoop.
       - `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
                  option is only valid when using the execution version of Hive.
       - `maven` - download the correct version of hive on demand from maven.
      
      By default, `builtin` is used for Hive 13.
      
      This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores.  However, the full removal of the Shim is deferred until a later PR.
      
      Remaining TODOs:
       - Remove the Hive Shims and inline code for Hive 13.
       - Several HiveCompatibility tests are not yet passing.
        - `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer.  However, we currently only handle the common cases and not things like CTAS where the null format is specified.
        - `combine1` now leaks state about compression somehow, breaking all subsequent tests.  As such we currently add it to the blacklist
        - `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore.  We are correctly propagating the information
        - "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests.  It seems our `RESET` mechanism may not be as robust as it used to be?
      
      Other required changes:
       -  `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline.  Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`.  The full parsing here is not yet complete as detailed above in the remaining TODOs.  Since the operator is Hive specific, it is moved to the hive package.
       - `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:
      
      258d000 [Michael Armbrust] really really correct path handling
      e56fd4a [Michael Armbrust] getAbsolutePath
      5a259f5 [Michael Armbrust] fix typos
      81bb366 [Michael Armbrust] comments from vanzin
      5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      4b5cd41 [Michael Armbrust] yin's comments
      f5de7de [Michael Armbrust] cleanup
      11e9c72 [Michael Armbrust] better coverage in versions suite
      7e8f010 [Michael Armbrust] better error messages and jar handling
      e7b3941 [Michael Armbrust] more permisive checking for function registration
      da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      5fe5894 [Michael Armbrust] fix serialization suite
      81711c4 [Michael Armbrust] Initial support for running without maven
      1d8ae44 [Michael Armbrust] fix final tests?
      1c50813 [Michael Armbrust] more comments
      a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
      a6f5df1 [Michael Armbrust] style
      ab07f7e [Michael Armbrust] WIP
      4d8bf02 [Michael Armbrust] Remove hive 12 compilation
      8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
      cd1d4110
    • zsxwing's avatar
      [SPARK-7305] [STREAMING] [WEBUI] Make BatchPage show friendly information when... · 22ab70e0
      zsxwing authored
      [SPARK-7305] [STREAMING] [WEBUI] Make BatchPage show friendly information when jobs are dropped by SparkListener
      
      If jobs are dropped by SparkListener, at least we can show the job ids in BatchPage. Screenshot:
      
      ![b1](https://cloud.githubusercontent.com/assets/1000778/7434968/f19aa784-eff3-11e4-8f86-36a073873574.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5840 from zsxwing/SPARK-7305 and squashes the following commits:
      
      aca0ba6 [zsxwing] Fix the code style
      718765e [zsxwing] Make generateNormalJobRow private
      8073b03 [zsxwing] Merge branch 'master' into SPARK-7305
      83dec11 [zsxwing] Make BatchPage show friendly information when jobs are dropped by SparkListener
      22ab70e0
    • tedyu's avatar
      [SPARK-7450] Use UNSAFE.getLong() to speed up BitSetMethods#anySet() · 88063c62
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #5897 from tedyu/master and squashes the following commits:
      
      473bf9d [tedyu] Address Josh's review comments
      1719c5b [tedyu] Correct upper bound in for loop
      b51dcaf [tedyu] Add unit test in BitSetSuite for BitSet#anySet()
      83f9f87 [tedyu] Merge branch 'master' of github.com:apache/spark
      817e3f9 [tedyu] Replace constant 8 with SIZE_OF_LONG
      75a467b [tedyu] Correct offset for UNSAFE.getLong()
      855374b [tedyu] Remove second loop since bitSetWidthInBytes is WORD aligned
      093b7a4 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      63ee050 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      4ca0ef6 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      3e9b6919 [tedyu] Use UNSAFE.getLong() to speed up BitSetMethods#anySet()
      88063c62
    • Wenchen Fan's avatar
      [SPARK-2155] [SQL] [WHEN D THEN E] [ELSE F] add CaseKeyWhen for "CASE a WHEN b THEN c * END" · 35f0173b
      Wenchen Fan authored
      Avoid translating to CaseWhen and evaluate the key expression many times.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5979 from cloud-fan/condition and squashes the following commits:
      
      3ce54e1 [Wenchen Fan] add CaseKeyWhen
      35f0173b
    • Iulian Dragos's avatar
      [SPARK-5281] [SQL] Registering table on RDD is giving MissingRequirementError · 937ba798
      Iulian Dragos authored
      Go through the context classloader when reflecting on user types in ScalaReflection.
      
      Replaced calls to `typeOf` with `typeTag[T].in(mirror)`. The convenience method assumes
      all types can be found in the classloader that loaded scala-reflect (the primordial
      classloader). This assumption is not valid in all contexts (sbt console, Eclipse launchers).
      
      Fixed SPARK-5281
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #5981 from dragos/issue/mirrors-missing-requirement-error and squashes the following commits:
      
      d103e70 [Iulian Dragos] Go through the context classloader when reflecting on user types in ScalaReflection
      937ba798
    • Liang-Chi Hsieh's avatar
      [SPARK-7277] [SQL] Throw exception if the property mapred.reduce.tasks is set to -1 · ea3077f1
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7277
      
      As automatically determining the number of reducers is not supported (`mapred.reduce.tasks` is set to `-1`), we should throw exception to users.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5811 from viirya/no_neg_reduce_tasks and squashes the following commits:
      
      e518f96 [Liang-Chi Hsieh] Consider other wrong setting values.
      fd9c817 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into no_neg_reduce_tasks
      4ede705 [Liang-Chi Hsieh] Throw exception instead of warning message.
      68a1c70 [Liang-Chi Hsieh] Show warning message if mapred.reduce.tasks is set to -1.
      ea3077f1
    • scwf's avatar
      [SQL] [MINOR] make star and multialias extend NamedExpression · 97d1182a
      scwf authored
      `Star` and `MultiAlias` just used in `analyzer` and them will be substituted after analyze,  So just like `Alias` they do not need extend `Attribute`
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5928 from scwf/attribute and squashes the following commits:
      
      73a0560 [scwf] star and multialias do not need extend attribute
      97d1182a
    • Xiangrui Meng's avatar
      [SPARK-6948] [MLLIB] compress vectors in VectorAssembler · e43803b8
      Xiangrui Meng authored
      The compression is based on storage. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5985 from mengxr/SPARK-6948 and squashes the following commits:
      
      df56a00 [Xiangrui Meng] update python tests
      6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler
      e43803b8
    • Octavian Geagla's avatar
      [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product Transformer · 658a478d
      Octavian Geagla authored
      See https://issues.apache.org/jira/browse/SPARK-5726
      
      Author: Octavian Geagla <ogeagla@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits:
      
      fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc.
      90f7e39 [Joseph K. Bradley] small cleanups
      4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case.
      ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
      37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback.
      1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks.
      e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF'
      cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct
      4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer
      658a478d
    • MechCoder's avatar
      [SPARK-7328] [MLLIB] [PYSPARK] Pyspark.mllib.linalg.Vectors: Missing items · 347a329a
      MechCoder authored
      Add
      1. Class methods squared_dist
      3. parse
      4. norm
      5. numNonzeros
      6. copy
      
      I made a few vectorizations wrt squared_dist and dot as well. I have added support for SparseMatrix serialization in a separate PR (https://github.com/apache/spark/pull/5775) and plan to complete support for Matrices in another PR.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5872 from MechCoder/local_linalg_api and squashes the following commits:
      
      a8ff1e0 [MechCoder] minor
      ce3e53e [MechCoder] Add error message for parser
      1bd3c04 [MechCoder] Robust parser and removed unnecessary methods
      f779561 [MechCoder] [SPARK-7328] Pyspark.mllib.linalg.Vectors: Missing items
      347a329a
    • Andrew Or's avatar
      [SPARK-7347] DAG visualization: add tooltips to RDDs · 88717ee4
      Andrew Or authored
      This is an addition to #5729.
      
      Here's an example with ALS.
      <img src="https://issues.apache.org/jira/secure/attachment/12731039/tooltip.png" width="400px"></img>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5957 from andrewor14/viz-hover2 and squashes the following commits:
      
      60e3758 [Andrew Or] Add tooltips for RDDs on job page
      88717ee4
    • Andrew Or's avatar
      [SPARK-7391] DAG visualization: auto expand if linked from another viz · f1216514
      Andrew Or authored
      This is an addition to #5729.
      
      If you click into a stage from the DAG viz on the job page, you might expect to expand on the stage. However, once you get to the stage page, you actually have to expand the DAG viz there yourself.
      
      This patch makes this happen automatically. It's a small UX improvement.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5958 from andrewor14/viz-auto-expand and squashes the following commits:
      
      03cd157 [Andrew Or] Automatically expand DAG viz if from job page
      f1216514
    • Timothy Chen's avatar
      [SPARK-7373] [MESOS] Add docker support for launching drivers in mesos cluster mode. · 4eecf550
      Timothy Chen authored
      Using the existing docker support for mesos, also enabling the mesos cluster mode scheduler to launch Spark drivers in docker images as well.
      
      This also allows the executors launched by the drivers to be also in the same Docker image by passing  the docker settings.
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #5917 from tnachen/spark_cluster_docker and squashes the following commits:
      
      1e842f5 [Timothy Chen] Add docker support for launching drivers in mesos cluster mode.
      4eecf550
    • Tijo Thomas's avatar
      [SPARK-7399] [SPARK CORE] Fixed compilation error in scala 2.11 · 0c33bf81
      Tijo Thomas authored
      scala has deterministic naming-scheme for the generated methods which return default arguments . here one of the default argument of overloaded method has to be removed
      
      Author: Tijo Thomas <tijoparacka@gmail.com>
      
      Closes #5966 from tijoparacka/fix_compilation_error_in_scala2.11 and squashes the following commits:
      
      c90bba8 [Tijo Thomas] Fixed compilation error in scala 2.11
      0c33bf81
    • Cheng Hao's avatar
      [SPARK-5213] [SQL] Remove the duplicated SparkSQLParser · 074d75d4
      Cheng Hao authored
      This is a follow up of #5827 to remove the additional `SparkSQLParser`
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5965 from chenghao-intel/remove_sparksqlparser and squashes the following commits:
      
      509a233 [Cheng Hao] Remove the HiveQlQueryExecution
      a5f9e3b [Cheng Hao] Remove the duplicated SparkSQLParser
      074d75d4
    • ksonj's avatar
      [SPARK-7116] [SQL] [PYSPARK] Remove cache() causing memory leak · dec8f537
      ksonj authored
      This patch simply removes a `cache()` on an intermediate RDD when evaluating Python UDFs.
      
      Author: ksonj <kson@siberie.de>
      
      Closes #5973 from ksonj/udf and squashes the following commits:
      
      db5b564 [ksonj] removed TODO about cleaning up
      fe70c54 [ksonj] Remove cache() causing memory leak
      dec8f537
    • Yin Huai's avatar
      [SPARK-1442] [SQL] [FOLLOW-UP] Address minor comments in Window Function PR (#5604). · 5784c8d9
      Yin Huai authored
      Address marmbrus and scwf's comments in #5604.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5945 from yhuai/windowFollowup and squashes the following commits:
      
      0ef879d [Yin Huai] Add collectFirst to TreeNode.
      2373968 [Yin Huai] wip
      4a16df9 [Yin Huai] Address minor comments for [SPARK-1442].
      5784c8d9
    • Yanbo Liang's avatar
      [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlib · 1712a7c7
      Yanbo Liang authored
      https://issues.apache.org/jira/browse/SPARK-6093
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5941 from yanboliang/spark-6093 and squashes the following commits:
      
      6934af3 [Yanbo Liang] change to @property
      aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
      1712a7c7
    • Olivier Girardot's avatar
      [SPARK-7118] [Python] Add the coalesce Spark SQL function available in PySpark · 068c3158
      Olivier Girardot authored
      This patch adds a proxy call from PySpark to the Spark SQL coalesce function and this patch comes out of a discussion on devspark with rxin
      
      This contribution is my original work and i license the work to the project under the project's open source license.
      
      Olivier.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5698 from ogirardot/master and squashes the following commits:
      
      d9a4439 [Olivier Girardot] SPARK-7118 Add the coalesce Spark SQL function available in PySpark
      068c3158
    • Burak Yavuz's avatar
      [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python · 9e2ffb13
      Burak Yavuz authored
      The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5930 from brkyvz/ml-feat and squashes the following commits:
      
      73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
      c221db9 [Xiangrui Meng] overload StringArrayParam.w
      c81072d [Burak Yavuz] addressed comments
      99c2ebf [Burak Yavuz] add to python_shared_params
      39ecb07 [Burak Yavuz] fix scalastyle
      7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
      9e2ffb13
    • Daoyuan Wang's avatar
      [SPARK-7330] [SQL] avoid NPE at jdbc rdd · ed9be06a
      Daoyuan Wang authored
      Thank nadavoosh point this out in #5590
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5877 from adrian-wang/jdbcrdd and squashes the following commits:
      
      cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd
      ed9be06a
    • Joseph K. Bradley's avatar
      [SPARK-7429] [ML] Params cleanups · 4f87e956
      Joseph K. Bradley authored
      Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does.
      
      CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5960 from jkbradley/params-cleanups and squashes the following commits:
      
      118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
      4f87e956
    • Joseph K. Bradley's avatar
      [SPARK-7421] [MLLIB] OnlineLDA cleanups · 8b6b46e4
      Joseph K. Bradley authored
      Small changes, primarily to allow us more flexibility in the future:
      * Rename "tau_0" to "tau0"
      * Mark LDAOptimizer trait sealed and DeveloperApi.
      * Mark LDAOptimizer subclasses as final.
      * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
      
      CC: hhbyyh
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits:
      
      f4be508 [Joseph K. Bradley] added newline
      f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
      8b6b46e4
    • ksonj's avatar
      [SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API · fae4e2d6
      ksonj authored
      Author: ksonj <kson@siberie.de>
      
      Closes #5971 from ksonj/doc and squashes the following commits:
      
      dadfebb [ksonj] __getitem__ is cleaner than __getattr__
      fae4e2d6
    • Shiti's avatar
      [SPARK-7295][SQL] bitwise operations for DataFrame DSL · fa8fddff
      Shiti authored
      Author: Shiti <ssaxena.ece@gmail.com>
      
      Closes #5867 from Shiti/spark-7295 and squashes the following commits:
      
      71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
      fa8fddff
    • Tathagata Das's avatar
      [SPARK-7217] [STREAMING] Add configuration to control the default behavior of... · 01187f59
      Tathagata Das authored
      [SPARK-7217] [STREAMING] Add configuration to control the default behavior of StreamingContext.stop() implicitly calling SparkContext.stop()
      
      In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This PR adds a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5929 from tdas/SPARK-7217 and squashes the following commits:
      
      869a763 [Tathagata Das] Changed implementation.
      685fe00 [Tathagata Das] Added configuration
      01187f59
    • Tathagata Das's avatar
      [SPARK-7430] [STREAMING] [TEST] General improvements to streaming tests to increase debuggability · cfdadcbd
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5961 from tdas/SPARK-7430 and squashes the following commits:
      
      d654978 [Tathagata Das] Fix scala style
      fbf7174 [Tathagata Das] Added more verbose assert failure messages.
      6aea07a [Tathagata Das] Ensure SynchronizedBuffer is used in every TestSuiteBase
      cfdadcbd
    • Nathan Howell's avatar
      [SPARK-5938] [SPARK-5443] [SQL] Improve JsonRDD performance · 2d6612cc
      Nathan Howell authored
      This patch comprises of a few related pieces of work:
      
      * Schema inference is performed directly on the JSON token stream
      * `String => Row` conversion populate Spark SQL structures without intermediate types
      * Projection pushdown is implemented via CatalystScan for DataFrame queries
      * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false`
      
      Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:
      
      ```
      Command                                            | Baseline | Patched
      ---------------------------------------------------|----------|--------
      import sqlContext.implicits._                      |          |
      val df = sqlContext.jsonFile("/tmp/lastfm.json")   |    70.0s |   14.6s
      df.count()                                         |    28.8s |    6.2s
      df.rdd.count()                                     |    35.3s |   21.5s
      df.where($"artist" === "Robert Hood").collect()    |    28.3s |   16.9s
      ```
      
      To prepare this dataset for benchmarking, follow these steps:
      
      ```
      # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm
      wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \
           http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip
      
      # Decompress and combine, pipe through `jq -c` to ensure there is one record per line
      unzip -p lastfm_test.zip lastfm_train.zip  | jq -c . > lastfm.json
      ```
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #5801 from NathanHowell/json-performance and squashes the following commits:
      
      26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation
      a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation
      e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag
      6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects
      fa8234f [Nathan Howell] Wrap long lines
      b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI`
      15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy
      f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes
      fa0be47 [Nathan Howell] Remove unused default case in the field parser
      80dba17 [Nathan Howell] Add comments regarding null handling and empty strings
      842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2
      ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2
      f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD
      0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance
      7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
      2d6612cc
    • Sun Rui's avatar
      [SPARK-6812] [SPARKR] filter() on DataFrame does not work as expected. · 9cfa9a51
      Sun Rui authored
      According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
      " if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
      In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
      We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().
      
      BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5938 from sun-rui/SPARK-6812 and squashes the following commits:
      
      b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
      9cfa9a51
    • Xiangrui Meng's avatar
      [SPARK-7432] [MLLIB] disable cv doctest · 773aa252
      Xiangrui Meng authored
      Temporarily disable flaky doctest for CrossValidator. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5962 from mengxr/disable-pyspark-cv-test and squashes the following commits:
      
      5db7e5b [Xiangrui Meng] disable cv doctest
      773aa252
  3. May 06, 2015
Loading