Skip to content
Snippets Groups Projects
  1. Aug 07, 2015
    • Reynold Xin's avatar
      [SPARK-9736] [SQL] JoinedRow.anyNull should delegate to the underlying rows. · 9897cc5e
      Reynold Xin authored
      JoinedRow.anyNull currently loops through every field to check for null, which is inefficient if the underlying rows are UnsafeRows. It should just delegate to the underlying implementation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8027 from rxin/SPARK-9736 and squashes the following commits:
      
      03a2e92 [Reynold Xin] Include all files.
      90f1add [Reynold Xin] [SPARK-9736][SQL] JoinedRow.anyNull should delegate to the underlying rows.
      9897cc5e
    • Wenchen Fan's avatar
      [SPARK-8382] [SQL] Improve Analysis Unit test framework · 2432c2e2
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8025 from cloud-fan/analysis and squashes the following commits:
      
      51461b1 [Wenchen Fan] move test file to test folder
      ec88ace [Wenchen Fan] Improve Analysis Unit test framework
      2432c2e2
    • Reynold Xin's avatar
      [SPARK-9674][SPARK-9667] Remove SparkSqlSerializer2 · 76eaa701
      Reynold Xin authored
      It is now subsumed by various Tungsten operators.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7981 from rxin/SPARK-9674 and squashes the following commits:
      
      144f96e [Reynold Xin] Re-enable test
      58b7332 [Reynold Xin] Disable failing list.
      fb797e3 [Reynold Xin] Match all UDTs.
      be9f243 [Reynold Xin] Updated if.
      71fc99c [Reynold Xin] [SPARK-9674][SPARK-9667] Remove GeneratedAggregate & SparkSqlSerializer2.
      76eaa701
    • zsxwing's avatar
      [SPARK-9467][SQL]Add SQLMetric to specialize accumulators to avoid boxing · ebfd91c5
      zsxwing authored
      This PR adds SQLMetric/SQLMetricParam/SQLMetricValue to specialize accumulators to avoid boxing. All SQL metrics should use these classes rather than `Accumulator`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7996 from zsxwing/sql-accu and squashes the following commits:
      
      14a5f0a [zsxwing] Address comments
      367ca23 [zsxwing] Use localValue directly to avoid changing Accumulable
      42f50c3 [zsxwing] Add SQLMetric to specialize accumulators to avoid boxing
      ebfd91c5
    • Wenchen Fan's avatar
      [SPARK-9683] [SQL] copy UTF8String when convert unsafe array/map to safe · e57d6b56
      Wenchen Fan authored
      When we convert unsafe row to safe row, we will do copy if the column is struct or string type. However, the string inside unsafe array/map are not copied, which may cause problems.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7990 from cloud-fan/copy and squashes the following commits:
      
      c13d1e3 [Wenchen Fan] change test name
      fe36294 [Wenchen Fan] we should deep copy UTF8String when convert unsafe row to safe row
      e57d6b56
    • Davies Liu's avatar
      [SPARK-9453] [SQL] support records larger than page size in UnsafeShuffleExternalSorter · 15bd6f33
      Davies Liu authored
      This patch follows exactly #7891 (except testing)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8005 from davies/larger_record and squashes the following commits:
      
      f9c4aff [Davies Liu] address comments
      9de5c72 [Davies Liu] support records larger than page size in UnsafeShuffleExternalSorter
      15bd6f33
    • Reynold Xin's avatar
      [SPARK-9700] Pick default page size more intelligently. · 4309262e
      Reynold Xin authored
      Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node).
      
      This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8012 from rxin/pagesize and squashes the following commits:
      
      16f4756 [Reynold Xin] Fixed failing test.
      5afd570 [Reynold Xin] private...
      0d5fb98 [Reynold Xin] Update default value.
      674a6cd [Reynold Xin] Address review feedback.
      dc00e05 [Reynold Xin] Merge with master.
      73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.
      4309262e
    • zsxwing's avatar
      [SPARK-8862][SQL]Support multiple SQLContexts in Web UI · 7aaed1b1
      zsxwing authored
      This is a follow-up PR to solve the UI issue when there are multiple SQLContexts. Each SQLContext has a separate tab and contains queries which are executed by this SQLContext.
      
      <img width="1366" alt="multiple sqlcontexts" src="https://cloud.githubusercontent.com/assets/1000778/9088391/54584434-3bc2-11e5-9caf-94c2b0da528e.png">
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7962 from zsxwing/multi-sqlcontext-ui and squashes the following commits:
      
      cf661e1 [zsxwing] sql -> SQL
      39b0c97 [zsxwing] Support multiple SQLContexts in Web UI
      7aaed1b1
    • Cheng Lian's avatar
      [SPARK-7550] [SQL] [MINOR] Fixes logs when persisting DataFrames · f0cda587
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8021 from liancheng/spark-7550/fix-logs and squashes the following commits:
      
      b7bd0ed [Cheng Lian] Fixes logs
      f0cda587
  2. Aug 06, 2015
    • zsxwing's avatar
      [SPARK-8057][Core]Call TaskAttemptContext.getTaskAttemptID using Reflection · 672f4676
      zsxwing authored
      Someone may use the Spark core jar in the maven repo with hadoop 1. SPARK-2075 has already resolved the compatibility issue to support it. But `SparkHadoopMapRedUtil.commitTask` broke it recently.
      
      This PR uses Reflection to call `TaskAttemptContext.getTaskAttemptID` to fix the compatibility issue.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6599 from zsxwing/SPARK-8057 and squashes the following commits:
      
      f7a343c [zsxwing] Remove the redundant import
      6b7f1af [zsxwing] Call TaskAttemptContext.getTaskAttemptID using Reflection
      672f4676
    • Jeff Zhang's avatar
      Fix doc typo · fe12277b
      Jeff Zhang authored
      Straightforward fix on doc typo
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #8019 from zjffdu/master and squashes the following commits:
      
      aed6e64 [Jeff Zhang] Fix doc typo
      fe12277b
    • Davies Liu's avatar
      [SPARK-9228] [SQL] use tungsten.enabled in public for both of codegen/unsafe · 17284db3
      Davies Liu authored
      
      spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
      
      cc marmbrus rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7998 from davies/tungsten and squashes the following commits:
      
      c1c16da [Davies Liu] update doc
      1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
      
      (cherry picked from commit 4e70e825)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      17284db3
    • Andrew Or's avatar
      [SPARK-9709] [SQL] Avoid starving unsafe operators that use sort · 014a9f9d
      Andrew Or authored
      The issue is that a task may run multiple sorts, and the sorts run by the child operator (i.e. parent RDD) may acquire all available memory such that other sorts in the same task do not have enough to proceed. This manifests itself in an `IOException("Unable to acquire X bytes of memory")` thrown by `UnsafeExternalSorter`.
      
      The solution is to reserve a page in each sorter in the chain before computing the child operator's (parent RDD's) partitions. This requires us to use a new special RDD that does some preparation before computing the parent's partitions.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8011 from andrewor14/unsafe-starve-memory and squashes the following commits:
      
      35b69a4 [Andrew Or] Simplify test
      0b07782 [Andrew Or] Minor: update comments
      5d5afdf [Andrew Or] Merge branch 'master' of github.com:apache/spark into unsafe-starve-memory
      254032e [Andrew Or] Add tests
      234acbd [Andrew Or] Reserve a page in sorter when preparing each partition
      b889e08 [Andrew Or] MapPartitionsWithPreparationRDD
      014a9f9d
    • Reynold Xin's avatar
      [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator. · b8782531
      Reynold Xin authored
      A small performance optimization – we don't need to generate a Tuple2 and then immediately discard the key. We also don't need an extra wrapper from InterruptibleIterator.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8000 from rxin/SPARK-9692 and squashes the following commits:
      
      1d4d0b3 [Reynold Xin] [SPARK-9692] Remove SqlNewHadoopRDD's generated Tuple2 and InterruptibleIterator.
      b8782531
    • Davies Liu's avatar
    • Michael Armbrust's avatar
      [SPARK-9650][SQL] Fix quoting behavior on interpolated column names · 0867b23c
      Michael Armbrust authored
      Make sure that `$"column"` is consistent with other methods with respect to backticks.  Adds a bunch of tests for various ways of constructing columns.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7969 from marmbrus/namesWithDots and squashes the following commits:
      
      53ef3d7 [Michael Armbrust] [SPARK-9650][SQL] Fix quoting behavior on interpolated column names
      2bf7a92 [Michael Armbrust] WIP
      0867b23c
    • Davies Liu's avatar
      [SPARK-9228] [SQL] use tungsten.enabled in public for both of codegen/unsafe · 4e70e825
      Davies Liu authored
      spark.sql.tungsten.enabled will be the default value for both codegen and unsafe, they are kept internally for debug/testing.
      
      cc marmbrus rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7998 from davies/tungsten and squashes the following commits:
      
      c1c16da [Davies Liu] update doc
      1a47be1 [Davies Liu] use tungsten.enabled for both of codegen/unsafe
      4e70e825
    • Yin Huai's avatar
      [SPARK-9691] [SQL] PySpark SQL rand function treats seed 0 as no seed · baf4587a
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-9691
      
      jkbradley rxin
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7999 from yhuai/pythonRand and squashes the following commits:
      
      4187e0c [Yin Huai] Regression test.
      a985ef9 [Yin Huai] Use "if seed is not None" instead "if seed" because "if seed" returns false when seed is 0.
      baf4587a
    • Sean Owen's avatar
      [SPARK-9633] [BUILD] SBT download locations outdated; need an update · 681e3024
      Sean Owen authored
      Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https.
      Follow up on https://github.com/apache/spark/pull/7792
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7956 from srowen/SPARK-9633 and squashes the following commits:
      
      caa40bd [Sean Owen] Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https.
      681e3024
    • Marcelo Vanzin's avatar
      [SPARK-9645] [YARN] [CORE] Allow shuffle service to read shuffle files. · e234ea1b
      Marcelo Vanzin authored
      Spark should not mess with the permissions of directories created
      by the cluster manager. Here, by setting the block manager dir
      permissions to 700, the shuffle service (running as the YARN user)
      wouldn't be able to serve shuffle files created by applications.
      
      Also, the code to protect the local app dir was missing in standalone's
      Worker; that has been now added. Since all processes run as the same
      user in standalone, `chmod 700` should not cause problems.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7966 from vanzin/SPARK-9645 and squashes the following commits:
      
      6e07b31 [Marcelo Vanzin] Protect the app dir in standalone mode.
      384ba6a [Marcelo Vanzin] [SPARK-9645] [yarn] [core] Allow shuffle service to read shuffle files.
      e234ea1b
    • Yin Huai's avatar
      [SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up) · 3504bf3a
      Yin Huai authored
      This is the followup of https://github.com/apache/spark/pull/7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7954 from yhuai/agg-followUp and squashes the following commits:
      
      4d2f4fc [Yin Huai] Add comments and free map.
      0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
      91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.
      3504bf3a
    • zsxwing's avatar
      [SPARK-9639] [STREAMING] Fix a potential NPE in Streaming JobScheduler · 34620909
      zsxwing authored
      Because `JobScheduler.stop(false)` may set `eventLoop` to null when `JobHandler` is running, then it's possible that when `post` is called, `eventLoop` happens to null.
      
      This PR fixed this bug and also set threads in `jobExecutor` to `daemon`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7960 from zsxwing/fix-npe and squashes the following commits:
      
      b0864c4 [zsxwing] Fix a potential NPE in Streaming JobScheduler
      34620909
    • cody koeninger's avatar
      [DOCS] [STREAMING] make the existing parameter docs for OffsetRange ac… · 1723e348
      cody koeninger authored
      …tually visible
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #7995 from koeninger/doc-fixes and squashes the following commits:
      
      87af9ea [cody koeninger] [Docs][Streaming] make the existing parameter docs for OffsetRange actually visible
      1723e348
    • Tathagata Das's avatar
      [SPARK-9556] [SPARK-9619] [SPARK-9624] [STREAMING] Make BlockGenerator more... · 0a078303
      Tathagata Das authored
      [SPARK-9556] [SPARK-9619] [SPARK-9624] [STREAMING] Make BlockGenerator more robust and make all BlockGenerators subscribe to rate limit updates
      
      In some receivers, instead of using the default `BlockGenerator` in `ReceiverSupervisorImpl`, custom generator with their custom listeners are used for reliability (see [`ReliableKafkaReceiver`](https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L99) and [updated `KinesisReceiver`](https://github.com/apache/spark/pull/7825/files)). These custom generators do not receive rate updates. This PR modifies the code to allow custom `BlockGenerator`s to be created through the `ReceiverSupervisorImpl` so that they can be kept track and rate updates can be applied.
      
      In the process, I did some simplification, and de-flaki-fication of some rate controller related tests. In particular.
      - Renamed `Receiver.executor` to `Receiver.supervisor` (to match `ReceiverSupervisor`)
      - Made `RateControllerSuite` faster (by increasing batch interval) and less flaky
      - Changed a few internal API to return the current rate of block generators as Long instead of Option\[Long\] (was inconsistent at places).
      - Updated existing `ReceiverTrackerSuite` to test that custom block generators get rate updates as well.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7913 from tdas/SPARK-9556 and squashes the following commits:
      
      41d4461 [Tathagata Das] fix scala style
      eb9fd59 [Tathagata Das] Updated kinesis receiver
      d24994d [Tathagata Das] Updated BlockGeneratorSuite to use manual clock in BlockGenerator
      d70608b [Tathagata Das] Updated BlockGenerator with states and proper synchronization
      f6bd47e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9556
      31da173 [Tathagata Das] Fix bug
      12116df [Tathagata Das] Add BlockGeneratorSuite
      74bd069 [Tathagata Das] Fix style
      989bb5c [Tathagata Das] Made BlockGenerator fail is used after stop, and added better unit tests for it
      3ff618c [Tathagata Das] Fix test
      b40eff8 [Tathagata Das] slight refactoring
      f0df0f1 [Tathagata Das] Scala style fixes
      51759cb [Tathagata Das] Refactored rate controller tests and added the ability to update rate of any custom block generator
      0a078303
    • Liang-Chi Hsieh's avatar
      [SPARK-9548][SQL] Add a destructive iterator for BytesToBytesMap · 21fdfd7d
      Liang-Chi Hsieh authored
      This pull request adds a destructive iterator to BytesToBytesMap. When used, the iterator frees pages as it traverses them. This is part of the effort to avoid starving when we have more than one operators that can exhaust memory.
      
      This is based on #7924, but fixes a bug there (Don't use destructive iterator in UnsafeKVExternalSorter).
      
      Closes #7924.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8003 from rxin/map-destructive-iterator and squashes the following commits:
      
      6b618c3 [Reynold Xin] Don't use destructive iterator in UnsafeKVExternalSorter.
      a7bd8ec [Reynold Xin] Merge remote-tracking branch 'viirya/destructive_iter' into map-destructive-iterator
      7652083 [Liang-Chi Hsieh] For comments: add destructiveIterator(), modify unit test, remove code block.
      4a3e9de [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
      581e9e3 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into destructive_iter
      f0ff783 [Liang-Chi Hsieh] No need to free last page.
      9e9d2a3 [Liang-Chi Hsieh] Add a destructive iterator for BytesToBytesMap.
      21fdfd7d
    • Christian Kadner's avatar
      [SPARK-9211] [SQL] [TEST] normalize line separators before generating MD5 hash · abfedb9c
      Christian Kadner authored
      The golden answer file names for the existing Hive comparison tests were generated using a MD5 hash of the query text which uses Unix-style line separator characters `\n` (LF).
      This PR ensures that all occurrences of the Windows-style line separator `\r\n` (CR) are replaced with `\n` (LF) before generating the MD5 hash to produce an identical MD5 hash for golden answer file names generated on Windows.
      
      Author: Christian Kadner <ckadner@us.ibm.com>
      
      Closes #7563 from ckadner/SPARK-9211_working and squashes the following commits:
      
      d541db0 [Christian Kadner] [SPARK-9211][SQL] normalize line separators before MD5 hash
      abfedb9c
    • Xiangrui Meng's avatar
      [SPARK-9493] [ML] add featureIndex to handle vector features in IsotonicRegression · 54c0789a
      Xiangrui Meng authored
      This PR contains the following changes:
      * add `featureIndex` to handle vector features (in order to chain isotonic regression easily with output from logistic regression
      * make getter/setter names consistent with params
      * remove inheritance from Regressor because it is tricky to handle both `DoubleType` and `VectorType`
      * simplify test data generation
      
      jkbradley zapletal-martin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7952 from mengxr/SPARK-9493 and squashes the following commits:
      
      8818ac3 [Xiangrui Meng] address comments
      05e2216 [Xiangrui Meng] address comments
      8d08090 [Xiangrui Meng] add featureIndex to handle vector features make getter/setter names consistent with params remove inheritance from Regressor
      54c0789a
    • Wenchen Fan's avatar
      [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info · 1f62f104
      Wenchen Fan authored
      This re-applies #7955, which was reverted due to a race condition to fix build breaking.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits:
      
      332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq
      21665e2 [Wenchen Fan] fix hive again...
      4addf29 [Wenchen Fan] fix hive
      bc16c59 [Wenchen Fan] minor fix
      33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
      3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
      1f62f104
    • Nilanjan Raychaudhuri's avatar
      [SPARK-8978] [STREAMING] Implements the DirectKafkaRateController · a1bbf1bc
      Nilanjan Raychaudhuri authored
      Author: Dean Wampler <dean@concurrentthought.com>
      Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
      Author: François Garillot <francois@garillot.net>
      
      Closes #7796 from dragos/topic/streaming-bp/kafka-direct and squashes the following commits:
      
      50d1f21 [Nilanjan Raychaudhuri] Taking care of the remaining nits
      648c8b1 [Dean Wampler] Refactored rate controller test to be more predictable and run faster.
      e43f678 [Nilanjan Raychaudhuri] fixing doc and nits
      ce19d2a [Dean Wampler] Removing an unreliable assertion.
      9615320 [Dean Wampler] Give me a break...
      6372478 [Dean Wampler] Found a few ways to make this test more robust...
      9e69e37 [Dean Wampler] Attempt to fix flakey test that fails in CI, but not locally :(
      d3db1ea [Dean Wampler] Fixing stylecheck errors.
      d04a288 [Nilanjan Raychaudhuri] adding test to make sure rate controller is used to calculate maxMessagesPerPartition
      b6ecb67 [Nilanjan Raychaudhuri] Fixed styling issue
      3110267 [Nilanjan Raychaudhuri] [SPARK-8978][Streaming] Implements the DirectKafkaRateController
      393c580 [François Garillot] [SPARK-8978][Streaming] Implements the DirectKafkaRateController
      51e78c6 [Nilanjan Raychaudhuri] Rename and fix build failure
      2795509 [Nilanjan Raychaudhuri] Added missing RateController
      19200f5 [Dean Wampler] Removed usage of infix notation. Changed a private variable name to be more consistent with usage.
      aa4a70b [François Garillot] [SPARK-8978][Streaming] Implements the DirectKafkaController
      a1bbf1bc
    • Sean Owen's avatar
      [SPARK-9641] [DOCS] spark.shuffle.service.port is not documented · 0d7aac99
      Sean Owen authored
      Document spark.shuffle.service.{enabled,port}
      
      CC sryza tgravescs
      This is pretty minimal; is there more to say here about the service?
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7991 from srowen/SPARK-9641 and squashes the following commits:
      
      3bb946e [Sean Owen] Add link to docs for setup and config of external shuffle service
      2302e01 [Sean Owen] Document spark.shuffle.service.{enabled,port}
      0d7aac99
    • Yin Huai's avatar
      [SPARK-9632] [SQL] [HOT-FIX] Fix build. · cdd53b76
      Yin Huai authored
      seems https://github.com/apache/spark/pull/7955 breaks the build.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8001 from yhuai/SPARK-9632-fixBuild and squashes the following commits:
      
      6c257dd [Yin Huai] Fix build.
      cdd53b76
    • Davies Liu's avatar
    • Wenchen Fan's avatar
      [SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info · 6e009cb9
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7955 from cloud-fan/toSeq and squashes the following commits:
      
      21665e2 [Wenchen Fan] fix hive again...
      4addf29 [Wenchen Fan] fix hive
      bc16c59 [Wenchen Fan] minor fix
      33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq
      3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow
      6e009cb9
    • Reynold Xin's avatar
      [SPARK-9659][SQL] Rename inSet to isin to match Pandas function. · 5e1b0ef0
      Reynold Xin authored
      Inspiration drawn from this blog post: https://lab.getbase.com/pandarize-spark-dataframes/
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7977 from rxin/isin and squashes the following commits:
      
      9b1d3d6 [Reynold Xin] Added return.
      2197d37 [Reynold Xin] Fixed test case.
      7c1b6cf [Reynold Xin] Import warnings.
      4f4a35d [Reynold Xin] [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
      5e1b0ef0
    • Burak Yavuz's avatar
      [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when... · 98e69467
      Burak Yavuz authored
      [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten
      
      In short:
      1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.
      
      2- Merging two partitions had a bug:
      
      **Existing behavior with size 3**
      Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
      Partition B -> Map(4 -> 25)
      Result -> Map()
      
      **Correct Behavior:**
      Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
      Partition B -> Map(4 -> 25)
      Result -> Map(3 -> 1, 4 -> 22)
      
      cc mengxr rxin JoshRosen
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #7945 from brkyvz/freq-fix and squashes the following commits:
      
      07fa001 [Burak Yavuz] address 2
      1dc61a8 [Burak Yavuz] address 1
      506753e [Burak Yavuz] fixed and added reg test
      47bfd50 [Burak Yavuz] pushing
      98e69467
    • MechCoder's avatar
      [SPARK-9533] [PYSPARK] [ML] Add missing methods in Word2Vec ML · 076ec056
      MechCoder authored
      After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7930 from MechCoder/spark-9533 and squashes the following commits:
      
      1bea394 [MechCoder] make getVectors a lazy val
      5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
      076ec056
    • MechCoder's avatar
      [SPARK-9112] [ML] Implement Stats for LogisticRegression · c5c6aded
      MechCoder authored
      I have added support for stats in LogisticRegression. The API is similar to that of LinearRegression with LogisticRegressionTrainingSummary and LogisticRegressionSummary
      
      I have some queries and asked them inline.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7538 from MechCoder/log_reg_stats and squashes the following commits:
      
      2e9f7c7 [MechCoder] Change defs into lazy vals
      d775371 [MechCoder] Clean up class inheritance
      9586125 [MechCoder] Add abstraction to handle Multiclass Metrics
      40ad8ef [MechCoder] minor
      640376a [MechCoder] remove unnecessary dataframe stuff and add docs
      80d9954 [MechCoder] Added tests
      fbed861 [MechCoder] DataFrame support for metrics
      70a0fc4 [MechCoder] [SPARK-9112] [ML] Implement Stats for LogisticRegression
      c5c6aded
    • Cheng Lian's avatar
      [SPARK-9593] [SQL] [HOTFIX] Makes the Hadoop shims loading fix more robust · 9f94c85f
      Cheng Lian authored
      This is a follow-up of #7929.
      
      We found that Jenkins SBT master build still fails because of the Hadoop shims loading issue. But the failure doesn't appear to be deterministic. My suspect is that Hadoop `VersionInfo` class may fail to inspect Hadoop version, and the shims loading branch is skipped.
      
      This PR tries to make the fix more robust:
      
      1. When Hadoop version is available, we load `Hadoop20SShims` for versions <= 2.0.x as srowen suggested in PR #7929.
      2. Otherwise, we use `Path.getPathWithoutSchemeAndAuthority` as a probe method, which doesn't exist in Hadoop 1.x or 2.0.x. If this method is not found, `Hadoop20SShims` is also loaded.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7994 from liancheng/spark-9593/fix-hadoop-shims and squashes the following commits:
      
      e1d3d70 [Cheng Lian] Fixes typo in comments
      8d971da [Cheng Lian] Makes the Hadoop shims loading fix more robust
      9f94c85f
    • Davies Liu's avatar
      [SPARK-9482] [SQL] Fix thread-safey issue of using UnsafeProjection in join · 93085c99
      Davies Liu authored
      This PR also change to use `def` instead of `lazy val` for UnsafeProjection, because it's not thread safe.
      
      TODO: cleanup the debug code once the flaky test passed 100 times.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7940 from davies/semijoin and squashes the following commits:
      
      93baac7 [Davies Liu] fix outerjoin
      5c40ded [Davies Liu] address comments
      aa3de46 [Davies Liu] Merge branch 'master' of github.com:apache/spark into semijoin
      7590a25 [Davies Liu] Merge branch 'master' of github.com:apache/spark into semijoin
      2d4085b [Davies Liu] use def for resultProjection
      0833407 [Davies Liu] Merge branch 'semijoin' of github.com:davies/spark into semijoin
      e0d8c71 [Davies Liu] use lazy val
      6a59e8f [Davies Liu] Update HashedRelation.scala
      0fdacaf [Davies Liu] fix broadcast and thread-safety of UnsafeProjection
      2fc3ef6 [Davies Liu] reproduce failure in semijoin
      93085c99
    • Davies Liu's avatar
      [SPARK-9644] [SQL] Support update DecimalType with precision > 18 in UnsafeRow · 5b965d64
      Davies Liu authored
      In order to support update a varlength (actually fixed length) object, the space should be preserved even  it's null. And, we can't call setNullAt(i) for it anymore, we because setNullAt(i) will remove the offset of the preserved space, should call setDecimal(i, null, precision) instead.
      
      After this, we can do hash based aggregation on DecimalType with precision > 18. In a tests, this could decrease the end-to-end run time of aggregation query from 37 seconds (sort based) to 24 seconds (hash based).
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7978 from davies/update_decimal and squashes the following commits:
      
      bed8100 [Davies Liu] isSettable -> isMutable
      923c9eb [Davies Liu] address comments and fix bug
      385891d [Davies Liu] Merge branch 'master' of github.com:apache/spark into update_decimal
      36a1872 [Davies Liu] fix tests
      cd6c524 [Davies Liu] support set decimal with precision > 18
      5b965d64
Loading