Skip to content
Snippets Groups Projects
  1. Oct 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps. · 015f7ef5
      Marcelo Vanzin authored
      This change adds an API that encapsulates information about an app
      launched using the library. It also creates a socket-based communication
      layer for apps that are launched as child processes; the launching
      application listens for connections from launched apps, and once
      communication is established, the channel can be used to send updates
      to the launching app, or to send commands to the child app.
      
      The change also includes hooks for local, standalone/client and yarn
      masters.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7052 from vanzin/SPARK-8673.
      015f7ef5
    • Rerngvit Yanggratoke's avatar
      [SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions · 70f44ad2
      Rerngvit Yanggratoke authored
      [SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
      - Add function (together with roxygen2 doc) to DataFrame.R and generics.R
      - Expose the function in NAMESPACE
      - Add unit test for the function
      
      Author: Rerngvit Yanggratoke <rerngvit@kth.se>
      
      Closes #8962 from rerngvit/SPARK-10905.
      70f44ad2
    • Nick Pritchard's avatar
      [SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric · 5994cfe8
      Nick Pritchard authored
      Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
      
      Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
      
      Closes #8940 from pnpritchard/SPARK-10875.
      5994cfe8
    • Bryan Cutler's avatar
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with... · 5410747a
      Bryan Cutler authored
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters
      
      These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training.  Same with StreamingLinearRegressionWithSGD.  I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
      5410747a
  2. Oct 08, 2015
    • Andrew Or's avatar
      [SPARK-10956] Common MemoryManager interface for storage and execution · 67fbecbf
      Andrew Or authored
      This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible.
      
      This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: andrewor14 <andrew@databricks.com>
      
      Closes #9000 from andrewor14/memory-manager.
      67fbecbf
    • Hari Shreedharan's avatar
      [SPARK-10955] [STREAMING] Add a warning if dynamic allocation for Streaming applications · 09841290
      Hari Shreedharan authored
      Dynamic allocation can be painful for streaming apps and can lose data. Log a warning for streaming applications if dynamic allocation is enabled.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #8998 from harishreedharan/ss-log-error and squashes the following commits:
      
      462b264 [Hari Shreedharan] Improve log message.
      2733d94 [Hari Shreedharan] Minor change to warning message.
      eaa48cc [Hari Shreedharan] Log a warning instead of failing the application if dynamic allocation is enabled.
      725f090 [Hari Shreedharan] Add config parameter to allow dynamic allocation if the user explicitly sets it.
      b3f9a95 [Hari Shreedharan] Disable dynamic allocation and kill app if it is enabled.
      a4a5212 [Hari Shreedharan] [streaming] SPARK-10955. Disable dynamic allocation for Streaming applications.
      09841290
    • Hari Shreedharan's avatar
      [SPARK-11019] [STREAMING] [FLUME] Gracefully shutdown Flume receiver th… · fa3e4d8f
      Hari Shreedharan authored
      …reads.
      
      Wait for a minute for the receiver threads to shutdown before interrupting them.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #9041 from harishreedharan/flume-graceful-shutdown.
      fa3e4d8f
    • zero323's avatar
      [SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we… · 8e67882b
      zero323 authored
      __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry
      
          from pyspark.mllib.linalg import Vectors
          sv = Vectors.sparse(5, {1: 3})
          sv[0]
          ## 0.0
          sv[1]
          ## 3.0
          sv[2]
          ## Traceback (most recent call last):
          ##   File "<stdin>", line 1, in <module>
          ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
          ##     row_ind = inds[insert_index]
          ## IndexError: index out of bounds
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9009 from zero323/sparse_vector_index_error.
      8e67882b
    • Davies Liu's avatar
      [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL · 3390b400
      Davies Liu authored
      This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.
      
      A new session of SQLContext could be created by:
      
      1) create an new SQLContext
      2) call newSession() on existing SQLContext
      
      For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).
      
      CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.
      
      Added jars are still shared by all the sessions, because SparkContext does not support sessions.
      
      cc marmbrus yhuai rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8909 from davies/sessions.
      3390b400
    • Reynold Xin's avatar
      [SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. · 84ea2871
      Reynold Xin authored
      UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM).
      
      To reproduce, launch Spark using
      
      MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
      
      And then run the following
      
      scala> sql("select 1 xx").collect()
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9030 from rxin/SPARK-10914.
      84ea2871
    • Cheng Lian's avatar
      [SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format · 02149ff0
      Cheng Lian authored
      This PR refactors Parquet write path to follow parquet-format spec.  It's a successor of PR #7679, but with less non-essential changes.
      
      Major changes include:
      
      1.  Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport`
      
          - Writes Parquet data using standard layout defined in parquet-format
      
            Specifically, we are now writing ...
      
            - ... arrays and maps in standard 3-level structure with proper annotations and field names
            - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback
      
          - Supports legacy mode which is compatible with Spark 1.4 and prior versions
      
            The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`.
      
          - Eliminates per value data type dispatching costs via prebuilt composed writer functions
      
      1.  Cleans up the last pieces of old Parquet support code
      
      As pointed out by rxin previously, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity.  But I'd like to do this in a follow-up PR to minimize code review noises in this one.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.
      02149ff0
    • Josh Rosen's avatar
      [SPARK-10988] [SQL] Reduce duplication in Aggregate2's expression rewriting logic · 2816c89b
      Josh Rosen authored
      In `aggregate/utils.scala`, there is a substantial amount of duplication in the expression-rewriting logic. As a prerequisite to supporting imperative aggregate functions in `TungstenAggregate`, this patch refactors this file so that the same expression-rewriting logic is used for both `SortAggregate` and `TungstenAggregate`.
      
      In order to allow both operators to use the same rewriting logic, `TungstenAggregationIterator. generateResultProjection()` has been updated so that it first evaluates all declarative aggregate functions' `evaluateExpression`s and writes the results into a temporary buffer, and then uses this temporary buffer and the grouping expressions to evaluate the final resultExpressions. This matches the logic in SortAggregateIterator, where this two-pass approach is necessary in order to support imperative aggregates. If this change turns out to cause performance regressions, then we can look into re-implementing the single-pass evaluation in a cleaner way as part of a followup patch.
      
      Since the rewriting logic is now shared across both operators, this patch also extracts that logic and places it in `SparkStrategies`. This makes the rewriting logic a bit easier to follow, I think.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9015 from JoshRosen/SPARK-10988.
      2816c89b
    • Michael Armbrust's avatar
      [SPARK-10993] [SQL] Inital code generated encoder for product types · 9e66a53c
      Michael Armbrust authored
      This PR is a first cut at code generating an encoder that takes a Scala `Product` type and converts it directly into the tungsten binary format.  This is done through the addition of a new set of expression that can be used to invoke methods on raw JVM objects, extracting fields and converting the result into the required format.  These can then be used directly in an `UnsafeProjection` allowing us to leverage the existing encoding logic.
      
      According to some simple benchmarks, this can significantly speed up conversion (~4x).  However, replacing CatalystConverters is deferred to a later PR to keep this PR at a reasonable size.
      
      ```scala
      case class SomeInts(a: Int, b: Int, c: Int, d: Int, e: Int)
      
      val data = SomeInts(1, 2, 3, 4, 5)
      val encoder = ProductEncoder[SomeInts]
      val converter = CatalystTypeConverters.createToCatalystConverter(ScalaReflection.schemaFor[SomeInts].dataType)
      
      (1 to 5).foreach {iter =>
        benchmark(s"converter $iter") {
          var i = 100000000
          while (i > 0) {
            val res = converter(data).asInstanceOf[InternalRow]
            assert(res.getInt(0) == 1)
            assert(res.getInt(1) == 2)
            i -= 1
          }
        }
      
        benchmark(s"encoder $iter") {
          var i = 100000000
          while (i > 0) {
            val res = encoder.toRow(data)
            assert(res.getInt(0) == 1)
            assert(res.getInt(1) == 2)
            i -= 1
          }
        }
      }
      ```
      
      Results:
      ```
      [info] converter 1: 7170ms
      [info] encoder 1: 1888ms
      [info] converter 2: 6763ms
      [info] encoder 2: 1824ms
      [info] converter 3: 6912ms
      [info] encoder 3: 1802ms
      [info] converter 4: 7131ms
      [info] encoder 4: 1798ms
      [info] converter 5: 7350ms
      [info] encoder 5: 1912ms
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9019 from marmbrus/productEncoder.
      9e66a53c
    • Michael Armbrust's avatar
      Revert [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN · a8226a9f
      Michael Armbrust authored
      This reverts commit dcbd58a9 from #8983
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9034 from marmbrus/revert8654.
      a8226a9f
    • Wenchen Fan's avatar
      [SPARK-10337] [SQL] fix hive views on non-hive-compatible tables. · af2a5544
      Wenchen Fan authored
      add a new config to deal with this special case.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8990 from cloud-fan/view-master.
      af2a5544
    • Yin Huai's avatar
      [SPARK-10887] [SQL] Build HashedRelation outside of HashJoinNode. · 82d275f2
      Yin Huai authored
      This PR refactors `HashJoinNode` to take a existing `HashedRelation`. So, we can reuse this node for both `ShuffledHashJoin` and `BroadcastHashJoin`.
      
      https://issues.apache.org/jira/browse/SPARK-10887
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8953 from yhuai/SPARK-10887.
      82d275f2
    • tedyu's avatar
      [SPARK-11006] Rename NullColumnAccess as NullColumnAccessor · 2a6f614c
      tedyu authored
      davies
      I think NullColumnAccessor follows same convention for other accessors
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #9028 from tedyu/master.
      2a6f614c
    • Yanbo Liang's avatar
      [SPARK-7770] [ML] GBT validationTol change to compare with relative or absolute error · 22683560
      Yanbo Liang authored
      GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8549 from yanboliang/spark-7770.
      22683560
    • Holden Karau's avatar
      [SPARK-9718] [ML] linear regression training summary all columns · 0903c648
      Holden Karau authored
      LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
      0903c648
    • Dilip Biswal's avatar
      [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN (...) · dcbd58a9
      Dilip Biswal authored
      In the analysis phase , while processing the rules for IN predicate, we
      compare the in-list types to the lhs expression type and generate
      cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
      generating cast between in list types to NULL like cast (1 as NULL) which
      is not a valid cast.
      
      The fix is to not generate such a cast if the lhs type is a NullType instead
      we translate the expression to Literal(Null).
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #8983 from dilipbiswal/spark_8654.
      dcbd58a9
    • Michael Armbrust's avatar
      [SPARK-10998] [SQL] Show non-children in default Expression.toString · 5c9fdf74
      Michael Armbrust authored
      Its pretty hard to debug problems with expressions when you can't see all the arguments.
      
      Before: `invoke()`
      After: `invoke(inputObject#1, intField, IntegerType)`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9022 from marmbrus/expressionToString.
      5c9fdf74
    • Narine Kokhlikyan's avatar
      [SPARK-10836] [SPARKR] Added sort(x, decreasing, col, ... ) method to DataFrame · e8f90d9d
      Narine Kokhlikyan authored
      the sort function can be used as an alternative to arrange(... ).
      As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names
      
      for example:
      sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order
      
      sort(df, decreasing=TRUE, "col1")
      sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8920 from NarineK/sparkrsort.
      e8f90d9d
    • Marcelo Vanzin's avatar
      [SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event. · 56a9692f
      Marcelo Vanzin authored
      In YARN client mode, when the AM connects to the driver, it may be the case
      that the driver never needs to send a message back to the AM (i.e., no
      dynamic allocation or preemption). This triggers an issue in the netty rpc
      backend where no disconnection event is sent to endpoints, and the AM never
      exits after the driver shuts down.
      
      The real fix is too complicated, so this is a quick hack to unblock YARN
      client mode until we can work on the real fix. It forces the driver to
      send a message to the AM when the AM registers, thus establishing that
      connection and enabling the disconnection event when the driver goes
      away.
      
      Also, a minor side issue: when the executor is shutting down, it needs
      to send an "ack" back to the driver when using the netty rpc backend; but
      that "ack" wasn't being sent because the handler was shutting down the rpc
      env before returning. So added a change to delay the shutdown a little bit,
      allowing the ack to be sent back.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9021 from vanzin/SPARK-10987.
      56a9692f
    • Cheng Lian's avatar
      [SPARK-5775] [SPARK-5508] [SQL] Re-enable Hive Parquet array reading tests · 2df882ef
      Cheng Lian authored
      Since SPARK-5508 has already been fixed.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8999 from liancheng/spark-5775.enable-array-tests.
      2df882ef
    • Cheng Lian's avatar
      [SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow · 59b0606f
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9024 from liancheng/spark-10999.coalesce-unsafe-row-handling.
      59b0606f
    • Jean-Baptiste Onofré's avatar
      [SPARK-10883] Add a note about how to build Spark sub-modules (reactor) · 60150cf0
      Jean-Baptiste Onofré authored
      Author: Jean-Baptiste Onofré <jbonofre@apache.org>
      
      Closes #8993 from jbonofre/SPARK-10883-2.
      60150cf0
    • admackin's avatar
      Akka framesize units should be specified · cd28139c
      admackin authored
      1.4 docs noted that the units were MB - i have assumed this is still the case
      
      Author: admackin <admackin@users.noreply.github.com>
      
      Closes #9025 from admackin/master.
      cd28139c
    • 0x0FFF's avatar
      [SPARK-7869][SQL] Adding Postgres JSON and JSONb data types support · b8f849b5
      0x0FFF authored
      This PR addresses [SPARK-7869](https://issues.apache.org/jira/browse/SPARK-7869)
      
      Before the patch, attempt to load the table from Postgres with JSON/JSONb datatype caused error `java.sql.SQLException: Unsupported type 1111`
      Postgres data types JSON and JSONb are now mapped to String on Spark side thus they can be loaded into DF and processed on Spark side
      
      Example
      
      Postgres:
      ```
      create table test_json  (id int, value json);
      create table test_jsonb (id int, value jsonb);
      
      insert into test_json (id, value) values
      (1, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::json),
      (2, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::json),
      (3, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::json);
      
      insert into test_jsonb (id, value) values
      (4, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::jsonb),
      (5, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::jsonb),
      (6, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::jsonb);
      ```
      
      PySpark:
      ```
      >>> import json
      >>> df1 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_json")
      >>> df1.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field3'))).collect()
      [(1, [1, 2, 3]), (2, [4, 5, 6]), (3, [7, 8, 9])]
      >>> df2 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_jsonb")
      >>> df2.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field1'))).collect()
      [(4, u'value1'), (5, u'value3'), (6, None)]
      ```
      
      Author: 0x0FFF <programmerag@gmail.com>
      
      Closes #8948 from 0x0FFF/SPARK-7869.
      b8f849b5
  3. Oct 07, 2015
    • Holden Karau's avatar
      [SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression · 3aff0866
      Holden Karau authored
      Add the Python API for isotonicregression.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
      3aff0866
    • Nathan Howell's avatar
      [SPARK-10064] [ML] Parallelize decision tree bin split calculations · 1bc435ae
      Nathan Howell authored
      Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
      
      With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
      
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #8246 from NathanHowell/SPARK-10064.
      1bc435ae
    • Davies Liu's avatar
      [SPARK-10917] [SQL] improve performance of complex type in columnar cache · 075a0b65
      Davies Liu authored
      This PR improve the performance of complex types in columnar cache by using UnsafeProjection instead of KryoSerializer.
      
      A simple benchmark show that this PR could improve the performance of scanning a cached table with complex columns by 15x (comparing to Spark 1.5).
      
      Here is the code used to benchmark:
      
      ```
      df = sc.range(1<<23).map(lambda i: Row(a=Row(b=i, c=str(i)), d=range(10), e=dict(zip(range(10), [str(i) for i in range(10)])))).toDF()
      df.write.parquet("table")
      ```
      ```
      df = sqlContext.read.parquet("table")
      df.cache()
      df.count()
      t = time.time()
      print df.select("*")._jdf.queryExecution().toRdd().count()
      print time.time() - t
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8971 from davies/complex.
      075a0b65
    • DB Tsai's avatar
      [SPARK-10738] [ML] Refactoring `Instance` out from LOR and LIR, and also cleaning up some code · dd36ec6b
      DB Tsai authored
      Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #8853 from dbtsai/refactoring.
      dd36ec6b
    • Josh Rosen's avatar
      [SPARK-9702] [SQL] Use Exchange to implement logical Repartition operator · 7e2e2682
      Josh Rosen authored
      This patch allows `Repartition` to support UnsafeRows. This is accomplished by implementing the logical `Repartition` operator in terms of `Exchange` and a new `RoundRobinPartitioning`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8083 from JoshRosen/SPARK-9702.
      7e2e2682
    • Davies Liu's avatar
      [SPARK-10980] [SQL] fix bug in create Decimal · 37526aca
      Davies Liu authored
      The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0.
      
      This bug exists since the beginning.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9014 from davies/fix_decimal.
      37526aca
    • Yanbo Liang's avatar
      [SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS · 7bf07faa
      Yanbo Liang authored
      Consolidate the Cholesky solvers in WeightedLeastSquares and ALS.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8936 from yanboliang/spark-10490.
      7bf07faa
    • Reynold Xin's avatar
      [SPARK-10982] [SQL] Rename ExpressionAggregate -> DeclarativeAggregate. · 6dbfd7ec
      Reynold Xin authored
      DeclarativeAggregate matches more closely with ImperativeAggregate we already have.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9013 from rxin/SPARK-10982.
      6dbfd7ec
    • Evan Chen's avatar
      [SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) · da936fbb
      Evan Chen authored
      Provide initialModel param for pyspark.mllib.clustering.KMeans
      
      Author: Evan Chen <chene@us.ibm.com>
      
      Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
      da936fbb
    • navis.ryu's avatar
      [SPARK-10679] [CORE] javax.jdo.JDOFatalUserException in executor · 713e4f44
      navis.ryu authored
      HadoopRDD throws exception in executor, something like below.
      {noformat}
      5/09/17 18:51:21 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
      15/09/17 18:51:21 INFO metastore.ObjectStore: ObjectStore, initialize called
      15/09/17 18:51:21 WARN metastore.HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
      javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
      	at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
      	at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
      	at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
      	at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
      	at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
      	at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
      	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
      	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
      	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
      	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57)
      	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
      	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
      	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
      	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
      	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
      	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
      	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
      	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
      	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
      	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
      	at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
      	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
      	at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
      	at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803)
      	at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782)
      	at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:298)
      	at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
      	at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
      	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
      	at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
      	at scala.Option.map(Option.scala:145)
      	at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
      	at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
      	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
      	at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
      	at org.apache.spark.scheduler.Task.run(Task.scala:88)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      {noformat}
      
      Author: navis.ryu <navis@apache.org>
      
      Closes #8804 from navis/SPARK-10679.
      713e4f44
    • Liang-Chi Hsieh's avatar
      [SPARK-10856][SQL] Mapping TimestampType to DATETIME for SQL Server jdbc dialect · c14aee4d
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10856
      
      For Microsoft SQL Server, TimestampType should be mapped to DATETIME instead of TIMESTAMP.
      
      Related information for the datatype mapping: https://msdn.microsoft.com/en-us/library/ms378878(v=sql.110).aspx
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8978 from viirya/mysql-jdbc-timestamp.
      c14aee4d
    • Marcelo Vanzin's avatar
      [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 94fc57af
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8775 from vanzin/SPARK-10300.
      94fc57af
Loading