Skip to content
Snippets Groups Projects
  1. May 07, 2015
    • Yanbo Liang's avatar
      [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlib · 1712a7c7
      Yanbo Liang authored
      https://issues.apache.org/jira/browse/SPARK-6093
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5941 from yanboliang/spark-6093 and squashes the following commits:
      
      6934af3 [Yanbo Liang] change to @property
      aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib
      1712a7c7
    • Olivier Girardot's avatar
      [SPARK-7118] [Python] Add the coalesce Spark SQL function available in PySpark · 068c3158
      Olivier Girardot authored
      This patch adds a proxy call from PySpark to the Spark SQL coalesce function and this patch comes out of a discussion on devspark with rxin
      
      This contribution is my original work and i license the work to the project under the project's open source license.
      
      Olivier.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5698 from ogirardot/master and squashes the following commits:
      
      d9a4439 [Olivier Girardot] SPARK-7118 Add the coalesce Spark SQL function available in PySpark
      068c3158
    • Burak Yavuz's avatar
      [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python · 9e2ffb13
      Burak Yavuz authored
      The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5930 from brkyvz/ml-feat and squashes the following commits:
      
      73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
      c221db9 [Xiangrui Meng] overload StringArrayParam.w
      c81072d [Burak Yavuz] addressed comments
      99c2ebf [Burak Yavuz] add to python_shared_params
      39ecb07 [Burak Yavuz] fix scalastyle
      7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
      9e2ffb13
    • Daoyuan Wang's avatar
      [SPARK-7330] [SQL] avoid NPE at jdbc rdd · ed9be06a
      Daoyuan Wang authored
      Thank nadavoosh point this out in #5590
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5877 from adrian-wang/jdbcrdd and squashes the following commits:
      
      cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd
      ed9be06a
    • Joseph K. Bradley's avatar
      [SPARK-7429] [ML] Params cleanups · 4f87e956
      Joseph K. Bradley authored
      Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does.
      
      CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5960 from jkbradley/params-cleanups and squashes the following commits:
      
      118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel
      4f87e956
    • Joseph K. Bradley's avatar
      [SPARK-7421] [MLLIB] OnlineLDA cleanups · 8b6b46e4
      Joseph K. Bradley authored
      Small changes, primarily to allow us more flexibility in the future:
      * Rename "tau_0" to "tau0"
      * Mark LDAOptimizer trait sealed and DeveloperApi.
      * Mark LDAOptimizer subclasses as final.
      * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
      
      CC: hhbyyh
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits:
      
      f4be508 [Joseph K. Bradley] added newline
      f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future
      8b6b46e4
    • ksonj's avatar
      [SPARK-7035] Encourage __getitem__ over __getattr__ on column access in the Python DataFrame API · fae4e2d6
      ksonj authored
      Author: ksonj <kson@siberie.de>
      
      Closes #5971 from ksonj/doc and squashes the following commits:
      
      dadfebb [ksonj] __getitem__ is cleaner than __getattr__
      fae4e2d6
    • Shiti's avatar
      [SPARK-7295][SQL] bitwise operations for DataFrame DSL · fa8fddff
      Shiti authored
      Author: Shiti <ssaxena.ece@gmail.com>
      
      Closes #5867 from Shiti/spark-7295 and squashes the following commits:
      
      71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
      fa8fddff
    • Tathagata Das's avatar
      [SPARK-7217] [STREAMING] Add configuration to control the default behavior of... · 01187f59
      Tathagata Das authored
      [SPARK-7217] [STREAMING] Add configuration to control the default behavior of StreamingContext.stop() implicitly calling SparkContext.stop()
      
      In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This PR adds a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5929 from tdas/SPARK-7217 and squashes the following commits:
      
      869a763 [Tathagata Das] Changed implementation.
      685fe00 [Tathagata Das] Added configuration
      01187f59
    • Tathagata Das's avatar
      [SPARK-7430] [STREAMING] [TEST] General improvements to streaming tests to increase debuggability · cfdadcbd
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5961 from tdas/SPARK-7430 and squashes the following commits:
      
      d654978 [Tathagata Das] Fix scala style
      fbf7174 [Tathagata Das] Added more verbose assert failure messages.
      6aea07a [Tathagata Das] Ensure SynchronizedBuffer is used in every TestSuiteBase
      cfdadcbd
    • Nathan Howell's avatar
      [SPARK-5938] [SPARK-5443] [SQL] Improve JsonRDD performance · 2d6612cc
      Nathan Howell authored
      This patch comprises of a few related pieces of work:
      
      * Schema inference is performed directly on the JSON token stream
      * `String => Row` conversion populate Spark SQL structures without intermediate types
      * Projection pushdown is implemented via CatalystScan for DataFrame queries
      * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false`
      
      Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset:
      
      ```
      Command                                            | Baseline | Patched
      ---------------------------------------------------|----------|--------
      import sqlContext.implicits._                      |          |
      val df = sqlContext.jsonFile("/tmp/lastfm.json")   |    70.0s |   14.6s
      df.count()                                         |    28.8s |    6.2s
      df.rdd.count()                                     |    35.3s |   21.5s
      df.where($"artist" === "Robert Hood").collect()    |    28.3s |   16.9s
      ```
      
      To prepare this dataset for benchmarking, follow these steps:
      
      ```
      # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm
      wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \
           http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip
      
      # Decompress and combine, pipe through `jq -c` to ensure there is one record per line
      unzip -p lastfm_test.zip lastfm_train.zip  | jq -c . > lastfm.json
      ```
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #5801 from NathanHowell/json-performance and squashes the following commits:
      
      26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation
      a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation
      e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag
      6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects
      fa8234f [Nathan Howell] Wrap long lines
      b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI`
      15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy
      f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes
      fa0be47 [Nathan Howell] Remove unused default case in the field parser
      80dba17 [Nathan Howell] Add comments regarding null handling and empty strings
      842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2
      ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2
      f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD
      0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance
      7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
      2d6612cc
    • Sun Rui's avatar
      [SPARK-6812] [SPARKR] filter() on DataFrame does not work as expected. · 9cfa9a51
      Sun Rui authored
      According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
      " if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options("defaultPackages")."
      In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list.
      We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First().
      
      BTW, I'd like to discuss our policy on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. If we agree on this, I can submit a JIRA issue to change back some rename API methods, for example, DataFrame.sortDF().
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5938 from sun-rui/SPARK-6812 and squashes the following commits:
      
      b569145 [Sun Rui] [SPARK-6812][SparkR] filter() on DataFrame does not work as expected.
      9cfa9a51
    • Xiangrui Meng's avatar
      [SPARK-7432] [MLLIB] disable cv doctest · 773aa252
      Xiangrui Meng authored
      Temporarily disable flaky doctest for CrossValidator. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5962 from mengxr/disable-pyspark-cv-test and squashes the following commits:
      
      5db7e5b [Xiangrui Meng] disable cv doctest
      773aa252
  2. May 06, 2015
    • zsxwing's avatar
      [SPARK-7405] [STREAMING] Fix the bug that ReceiverInputDStream doesn't report InputInfo · 14502d5e
      zsxwing authored
      The bug is because SPARK-7139 removed some codes from SPARK-7112 unintentionally here: https://github.com/apache/spark/commit/1854ac326a9cc6014817d8df30ed0458eee5d7d1#diff-5c8651dd78abd20439b8eb938175075dL72
      
      This PR just added them back and added some assertions in the tests to verify it.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5950 from zsxwing/SPARK-7405 and squashes the following commits:
      
      675f5d9 [zsxwing] Fix the bug that ReceiverInputDStream doesn't report InputInfo
      14502d5e
    • Andrew Or's avatar
      [HOT FIX] For DAG visualization #5954 · 71a452b6
      Andrew Or authored
      71a452b6
    • Andrew Or's avatar
      [SPARK-7371] [SPARK-7377] [SPARK-7408] DAG visualization addendum (#5729) · 8fa6829f
      Andrew Or authored
      This is a follow-up patch for #5729.
      
      **[SPARK-7408]** Move as much style code from JS to CSS as possible
      **[SPARK-7377]** Fix JS error if a job / stage contains only one RDD
      **[SPARK-7371]** Decrease emphasis on RDD on stage page as requested by mateiz pwendell
      
      This patch also includes general code clean up.
      
      <img src="https://issues.apache.org/jira/secure/attachment/12730992/before-after.png" width="500px"></img>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5954 from andrewor14/viz-emphasize-rdd and squashes the following commits:
      
      3c0d4f0 [Andrew Or] Guard against JS error by rendering arrows only if needed
      f23e15b [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz-emphasize-rdd
      565801f [Andrew Or] Clean up code
      9dab5f0 [Andrew Or] Move styling from JS to CSS + clean up code
      107c0b6 [Andrew Or] Tweak background color, stroke width, font size etc.
      1610c62 [Andrew Or] Implement cluster padding for stage page
      8fa6829f
    • jerryshao's avatar
      [SPARK-7396] [STREAMING] [EXAMPLE] Update KafkaWordCountProducer to use new Producer API · 316a5c04
      jerryshao authored
      Otherwise it will throw exception:
      
      ```
      Exception in thread "main" kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
      	at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
      	at kafka.producer.Producer.send(Producer.scala:77)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      ```
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5936 from jerryshao/SPARK-7396 and squashes the following commits:
      
      270bbe2 [jerryshao] Fix Kafka Produce throw Exception issue
      316a5c04
    • Shivaram Venkataraman's avatar
      [SPARK-6799] [SPARKR] Remove SparkR RDD examples, add dataframe examples · 4e930420
      Shivaram Venkataraman authored
      This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4
      
      cc rxin pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5949 from shivaram/sparkr-examples and squashes the following commits:
      
      6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
      4e930420
    • Andrew Or's avatar
    • Joseph K. Bradley's avatar
      [SPARK-5995] [ML] Make Prediction dev API public · 1ad04dae
      Joseph K. Bradley authored
      Changes:
      * Update protected prediction methods, following design doc. **<--most interesting change**
      * Changed abstract classes for Estimator and Model to be public.  Added DeveloperApi tag.  (I kept the traits for Estimator/Model Params private.)
      * Changed ProbabilisticClassificationModel method names to use probability instead of probabilities.
      
      CC: mengxr shivaram etrain
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5913 from jkbradley/public-dev-api and squashes the following commits:
      
      e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0
      15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names
      5cda84d [Joseph K. Bradley] regenerated sharedParams
      7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public.  Organized their prediction methods for efficient computation of multiple output columns.
      1ad04dae
    • Yin Huai's avatar
      [HOT-FIX] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir. · 77409967
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5951 from yhuai/fixBuildMaven and squashes the following commits:
      
      fdde183 [Yin Huai] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
      77409967
    • Josh Rosen's avatar
      Add `Private` annotation. · 845d1d4d
      Josh Rosen authored
      This was originally added as part of #4435, which was reverted.
      845d1d4d
    • Josh Rosen's avatar
      [SPARK-7311] Introduce internal Serializer API for determining if serializers... · 002c1238
      Josh Rosen authored
      [SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation
      
      This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.
      
      This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.
      
      This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits:
      
      50a68ca [Josh Rosen] Address minor nits
      0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
      123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
      4aa61b2 [Josh Rosen] Add missing newline
      2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
      0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
      450fa21 [Josh Rosen] Back out accidental log4j.properties change
      86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
      b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
      002c1238
    • Yin Huai's avatar
      [SPARK-1442] [SQL] Window Function Support for Spark SQL · f2c47082
      Yin Huai authored
      Adding more information about the implementation...
      
      This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause).
      # Implementation #
      The high level work flow of the implementation is described as follows.
      
      *	Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support.
      
      *	Analysis: The query analysis process has three steps for window functions.
      
       *	Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition.
       *	Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work.
       *	Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows
      
      ```
      sql("""
      SELECT
        year, country, product, sales,
        avg(sales) over(partition by product) avg_product,
        sum(sales) over(partition by country) sum_country
      FROM sales
      ORDER BY year, country, product
      """).explain(true)
      
      == Analyzed Logical Plan ==
      Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
       Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28]
        Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
         Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
          Project [year#34,country#35,product#36,sales#37]
           MetastoreRelation default, sales, None
      ```
      
      *	Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below.
      
      ```
      == Physical Plan ==
      Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
       Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), []
        Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
         Exchange (HashPartitioning [country#35], 200), [country#35 ASC]
          Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
           Exchange (HashPartitioning [product#36], 200), [product#36 ASC]
            HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None
      ```
      
      *	Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times.
      
      # Future work #
      
      Here are three improvements that are not hard to add:
      *	Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).
      
      *	When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value).
      
      *	When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them).
      
      This is a joint work with guowei2 and yhuai
      Thanks hbutani hvanhovell for his comments
      Thanks scwf for his comments and unit tests
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5604 from guowei2/windowImplement and squashes the following commits:
      
      76fe1c8 [Yin Huai] Implementation.
      aa2b0ae [Yin Huai] Tests.
      f2c47082
    • Daoyuan Wang's avatar
      [SPARK-6201] [SQL] promote string and do widen types for IN · c3eb441f
      Daoyuan Wang authored
      huangjs
      Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET.
      
      So it turn out that we only need to fix this for IN.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4945 from adrian-wang/inset and squashes the following commits:
      
      71e05cc [Daoyuan Wang] minor fix
      581fa1c [Daoyuan Wang] mysql way
      f3f7baf [Daoyuan Wang] address comments
      5eed4bc [Daoyuan Wang] promote string and do widen types for IN
      c3eb441f
    • Daoyuan Wang's avatar
      [SPARK-5456] [SQL] fix decimal compare for jdbc rdd · 150f671c
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5803 from adrian-wang/decimalcompare and squashes the following commits:
      
      aef0e96 [Daoyuan Wang] add null handle
      ec455b9 [Daoyuan Wang] fix decimal compare for jdbc rdd
      150f671c
    • Reynold Xin's avatar
      [SQL] JavaDoc update for various DataFrame functions. · 322e7e7f
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5935 from rxin/df-doc1 and squashes the following commits:
      
      aaeaadb [Reynold Xin] [SQL] JavaDoc update for various DataFrame functions.
      322e7e7f
    • Xiangrui Meng's avatar
      [SPARK-6940] [MLLIB] Add CrossValidator to Python ML pipeline API · 32cdc815
      Xiangrui Meng authored
      Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5926 from mengxr/SPARK-6940 and squashes the following commits:
      
      6af181f [Xiangrui Meng] add TODOs
      8285134 [Xiangrui Meng] update doc
      060f7c3 [Xiangrui Meng] update doctest
      acac727 [Xiangrui Meng] add keyword args
      cdddecd [Xiangrui Meng] add CrossValidator in Python
      32cdc815
    • zsxwing's avatar
      [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite · 9f019c72
      zsxwing authored
      Fixed the following failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/452/testReport/junit/org.apache.spark.broadcast/BroadcastSuite/Unpersisting_HttpBroadcast_on_executors_and_driver_in_distributed_mode/
      
      The tests should wait until all slaves are up. Otherwise, there may be only a part of `BlockManager`s registered, and fail the tests.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5925 from zsxwing/SPARK-7384 and squashes the following commits:
      
      783cb7b [zsxwing] Add comments for _jobProgressListener and remove postfixOps
      1009ef1 [zsxwing] [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite
      9f019c72
    • Yanbo Liang's avatar
      [SPARK-6267] [MLLIB] Python API for IsotonicRegression · 7b145783
      Yanbo Liang authored
      https://issues.apache.org/jira/browse/SPARK-6267
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5890 from yanboliang/spark-6267 and squashes the following commits:
      
      f20541d [Yanbo Liang] Merge pull request #3 from mengxr/SPARK-6267
      7f202f9 [Xiangrui Meng] use Vector to have the best Python 2&3 compatibility
      4bccfee [Yanbo Liang] fix doctest
      ec09412 [Yanbo Liang] fix typos
      8214bbb [Yanbo Liang] fix code style
      5c8ebe5 [Yanbo Liang] Python API for IsotonicRegression
      7b145783
    • Burak Yavuz's avatar
      [SPARK-7358][SQL] Move DataFrame mathfunctions into functions · ba2b5661
      Burak Yavuz authored
      After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions`
      
      cc rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5923 from brkyvz/move-math-funcs and squashes the following commits:
      
      a8dc3f7 [Burak Yavuz] address comments
      cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
      ba2b5661
  3. May 05, 2015
    • qhuang's avatar
      [SPARK-6841] [SPARKR] add support for mean, median, stdev etc. · a4669443
      qhuang authored
      Moving here from https://github.com/amplab-extras/SparkR-pkg/pull/241
      sum() has been implemented. (https://github.com/amplab-extras/SparkR-pkg/pull/242)
      
      Now Phase 1: mean, sd, var have been implemented, but some things still need to be improved with the suggestions in https://issues.apache.org/jira/browse/SPARK-6841
      
      Author: qhuang <qian.huang@intel.com>
      
      Closes #5446 from hqzizania/R and squashes the following commits:
      
      f283572 [qhuang] add test unit for describe()
      2e74d5a [qhuang] add describe() DataFrame API
      a4669443
    • Reynold Xin's avatar
      Revert "[SPARK-3454] separate json endpoints for data in the UI" · 51b3d41e
      Reynold Xin authored
      This reverts commit d4973580.
      
      The commit broke Spark on Windows.
      51b3d41e
    • Reynold Xin's avatar
      [SPARK-6231][SQL/DF] Automatically resolve join condition ambiguity for self-joins. · 1fd31ba0
      Reynold Xin authored
      See the comment in join function for more information.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5919 from rxin/self-join-resolve and squashes the following commits:
      
      e2fb0da [Reynold Xin] Updated SQLConf comment.
      7233a86 [Reynold Xin] Updated comment.
      6be2b4d [Reynold Xin] Removed println
      9f6b72f [Reynold Xin] [SPARK-6231][SQL/DF] Automatically resolve ambiguity in join condition for self-joins.
      1fd31ba0
    • Sandy Ryza's avatar
      Some minor cleanup after SPARK-4550. · 0092abb4
      Sandy Ryza authored
      JoshRosen this PR addresses the comments you left on #4450 after it got merged.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #5916 from sryza/sandy-spark-4550-cleanup and squashes the following commits:
      
      dee3d85 [Sandy Ryza] Some minor cleanup after SPARK-4550.
      0092abb4
    • Shivaram Venkataraman's avatar
      [SPARK-7230] [SPARKR] Make RDD private in SparkR. · c688e3c5
      Shivaram Venkataraman authored
      This change makes the RDD API private in SparkR and all internal uses of the SparkR API use SparkR::: to access private functions.
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5895 from shivaram/rrdd-private and squashes the following commits:
      
      bdb2f07 [Shivaram Venkataraman] Make RDD private in SparkR. This change also makes all internal uses of the SparkR API use SparkR::: to access private functions
      c688e3c5
    • wangfei's avatar
      [SQL][Minor] make StringComparison extends ExpectsInputTypes · 3059291e
      wangfei authored
      make StringComparison extends ExpectsInputTypes and added expectedChildTypes, so do not need override expectedChildTypes in each subclass
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #5905 from scwf/ExpectsInputTypes and squashes the following commits:
      
      b374ddf [wangfei] make stringcomparison extends ExpectsInputTypes
      3059291e
    • zsxwing's avatar
      [SPARK-7351] [STREAMING] [DOCS] Add spark.streaming.ui.retainedBatches to docs · fec7b29f
      zsxwing authored
      The default value will be changed to `1000` in #5533. So here I just used `1000`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5899 from zsxwing/SPARK-7351 and squashes the following commits:
      
      e1ec515 [zsxwing] [SPARK-7351][Streaming][Docs] Add spark.streaming.ui.retainedBatches to docs
      fec7b29f
    • 云峤's avatar
      [SPARK-7294][SQL] ADD BETWEEN · 735bc3d0
      云峤 authored
      Author: 云峤 <chensong.cs@alibaba-inc.com>
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #5839 from kaka1992/master and squashes the following commits:
      
      b15360d [kaka1992] Fix python unit test in sql/test. =_= I forget to commit this file last time.
      f928816 [kaka1992] Fix python style in sql/test.
      d2e7f72 [kaka1992] Fix python style in sql/test.
      c54d904 [kaka1992] Fix empty map bug.
      7e64d1e [云峤] Update
      7b9b858 [云峤] undo
      f080f8d [云峤] update pep8
      76f0c51 [云峤] Merge remote-tracking branch 'remotes/upstream/master'
      7d62368 [云峤] [SPARK-7294] ADD BETWEEN
      baf839b [云峤] [SPARK-7294] ADD BETWEEN
      d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
      735bc3d0
    • zsxwing's avatar
      [SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for streaming statistics · 489700c8
      zsxwing authored
      This is the initial work of SPARK-6939. Not yet ready for code review. Here are the screenshots:
      
      ![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png)
      
      ![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png)
      
      TODOs:
      - [x] Display more information on mouse hover
      - [x] Align the timeline and distribution graphs
      - [x] Clean up the codes
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits:
      
      9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939
      deacc3f [zsxwing] Remove unused import
      cd03424 [zsxwing] Fix .rat-excludes
      70cc87d [zsxwing] Streaming Scheduling Delay => Scheduling Delay
      d457277 [zsxwing] Fix UIUtils in BatchPage
      b3f303e [zsxwing] Add comments for unclear classes and methods
      ff0bff8 [zsxwing] Make InputDStream.name private[streaming]
      cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939
      e275e23 [zsxwing] Move time related methods to Streaming's UIUtils
      d5d86f6 [zsxwing] Fix incorrect lastErrorTime
      3be4b7a [zsxwing] Use InputInfo
      b50fa32 [zsxwing] Jump to the batch page when clicking a point in the timeline graphs
      203605d [zsxwing] Merge branch 'master' into SPARK-6939
      74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page size
      2586916 [zsxwing] Merge branch 'master' into SPARK-6939
      70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames
      7bbdc0a [zsxwing] Hide the receiver sub table if no receiver
      a2972e9 [zsxwing] Add some ui tests for StreamingPage
      fd03ad0 [zsxwing] Add a test to verify no memory leak
      4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939
      18607a1 [zsxwing] Merge branch 'master' into SPARK-6939
      d0b0aec [zsxwing] Clean up the codes
      a459f49 [zsxwing] Add a dash line to processing time graphs
      8e4363c [zsxwing] Prepare for the demo
      c81a1ee [zsxwing] Change time unit in the graphs automatically
      4c0b43f [zsxwing] Update Streaming UI
      04c7500 [zsxwing] Make the server and client use the same timezone
      fed8219 [zsxwing] Move the x axis at the top and show a better tooltip
      c23ce10 [zsxwing] Make two graphs close
      d78672a [zsxwing] Make the X axis use the same range
      881c907 [zsxwing] Use histogram for distribution
      5688702 [zsxwing] Fix the unit test
      ddf741a [zsxwing] Fix the unit test
      ad93295 [zsxwing] Remove unnecessary codes
      a0458f9 [zsxwing] Clean the codes
      b82ed1e [zsxwing] Update the graphs as per comments
      dd653a1 [zsxwing] Add timeline and histogram graphs for streaming statistics
      489700c8
Loading