Skip to content
Snippets Groups Projects
  1. Apr 21, 2016
    • Reynold Xin's avatar
      [SPARK-14798][SQL] Move native command and script transformation parsing into SparkSqlAstBuilder · 1a95397b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves native command and script transformation into SparkSqlAstBuilder. This builds on #12561. See the last commit for diff.
      
      ## How was this patch tested?
      Updated test cases to reflect this.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12564 from rxin/SPARK-14798.
      1a95397b
    • Andrew Or's avatar
      [MINOR] Comment whitespace changes in #12553 · ef6be7be
      Andrew Or authored
      ef6be7be
    • Andrew Or's avatar
      [SPARK-13643][SQL] Implement SparkSession · a2e8d4fd
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      After removing most of `HiveContext` in 8fc267ab we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12553 from andrewor14/implement-spark-session.
      a2e8d4fd
    • Reynold Xin's avatar
      [SPARK-14801][SQL] Move MetastoreRelation to its own file · 8e1bb045
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This class is currently in HiveMetastoreCatalog.scala, which is a large file that makes refactoring and searching of usage difficult. Moving it out so I can then do SPARK-14799 and make the review of that simpler.
      
      ## How was this patch tested?
      N/A - this is a straightforward move and should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12567 from rxin/SPARK-14801.
      8e1bb045
    • Shixiong Zhu's avatar
      [SPARK-14699][CORE] Stop endpoints before closing the connections and don't stop client in Outbox · e4904d87
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      In general, `onDisconnected` is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events.
      
      This PR moves `dispatcher.stop()` above closing the connections so that when stopping RpcEnv, the endpoints won't receive `onDisconnected` events.
      
      In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well.
      
      ## How was this patch tested?
      
      test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12481 from zsxwing/SPARK-14699.
      e4904d87
    • Reynold Xin's avatar
      [SPARK-14795][SQL] Remove the use of Hive's variable substitution · 3a21e8d5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch builds on #12556 and completely removes the use of Hive's variable substitution.
      
      ## How was this patch tested?
      Covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12561 from rxin/SPARK-14795.
      3a21e8d5
    • Reynold Xin's avatar
      [SPARK-14799][SQL] Remove MetastoreRelation dependency from AnalyzeTable - part 1 · 79008e6c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch isolates AnalyzeTable's dependency on MetastoreRelation into a single line. After this we can work on converging MetastoreRelation and CatalogTable.
      
      ## How was this patch tested?
      Covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12566 from rxin/SPARK-14799.
      79008e6c
    • Josh Rosen's avatar
      [SPARK-14783] Preserve full exception stacktrace in IsolatedClientLoader · a70d4031
      Josh Rosen authored
      In IsolatedClientLoader, we have a`catch` block which throws an exception without wrapping the original exception, causing the full exception stacktrace and any nested exceptions to be lost. This patch fixes this, improving the usefulness of classloading error messages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12548 from JoshRosen/improve-logging-for-hive-classloader-issues.
      a70d4031
    • Lianhui Wang's avatar
      [SPARK-4452] [CORE] Shuffle data structures can starve others on the same thread for memory · 4f369176
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution.
      But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core.
      
      ## How was this patch tested?
      add two unit tests for it.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #10024 from lianhuiwang/SPARK-4452-2.
      4f369176
    • Josh Rosen's avatar
      [SPARK-14797][BUILD] Spark SQL POM should not hardcode spark-sketch_2.11 dep. · 649335d6
      Josh Rosen authored
      Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs.
      
      This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa).
      
      /cc ahirreddy, who spotted this issue.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12563 from JoshRosen/fix-sketch-scala-version.
      649335d6
    • Parth Brahmbhatt's avatar
      [SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry... · 6fdd0e32
      Parth Brahmbhatt authored
      [SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry server to ensure a single large log does not block other logs from being rendered.
      
      ## What changes were proposed in this pull request?
      The patch makes event log processing multi threaded.
      
      ## How was this patch tested?
      Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.
      
      Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
      
      Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
      6fdd0e32
    • Liang-Chi Hsieh's avatar
      [HOTFIX] Remove wrong DDL tests · 4ac6e75c
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore.
      
      ## How was this patch tested?
      `DDLCommandSuite`
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12572 from viirya/hotfix-ddl.
      4ac6e75c
    • Bryan Cutler's avatar
      [SPARK-14779][CORE] Corrected log message in Worker case KillExecutor · d53a51c1
      Bryan Cutler authored
      In o.a.s.deploy.worker.Worker.scala, when receiving a KillExecutor message from an invalid Master, fixed typo by changing the log message to read "..attemped to kill executor.."
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12546 from BryanCutler/worker-killexecutor-log-message.
      d53a51c1
    • hyukjinkwon's avatar
      [SPARK-14787][SQL] Upgrade Joda-Time library from 2.9 to 2.9.3 · ec2a2760
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      https://issues.apache.org/jira/browse/SPARK-14787
      
      The possible problems are described in the JIRA above. Please refer this if you are wondering the purpose of this PR.
      
      This PR upgrades Joda-Time library from 2.9 to 2.9.3.
      
      ## How was this patch tested?
      
      `sbt scalastyle` and Jenkins tests in this PR.
      
      closes #11847
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12552 from HyukjinKwon/SPARK-14787.
      ec2a2760
    • Arash Parsa's avatar
      [SPARK-14739][PYSPARK] Fix Vectors parser bugs · 2b8906c4
      Arash Parsa authored
      ## What changes were proposed in this pull request?
      
      The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed.
      
      ## How was this patch tested?
      
      Standard unit-tests similar to other methods.
      
      Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal>
      Author: Arash Parsa <arashpa@gmail.com>
      Author: Vishnu Prasad <vishnu667@gmail.com>
      Author: Vishnu Prasad S <vishnu667@gmail.com>
      
      Closes #12516 from arashpa/SPARK-14739.
      2b8906c4
    • Sean Owen's avatar
      [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws... · 8bd05c9d
      Sean Owen authored
      [SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws non-declared InterruptedException
      
      ## What changes were proposed in this pull request?
      
      `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12418 from srowen/SPARK-8393.
      8bd05c9d
    • Wenchen Fan's avatar
      [SPARK-14753][CORE] remove internal flag in Accumulable · cb51680d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases:
      
      1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered.
      2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered.
      
      For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last.
      For 2, we can un-register these accumulators immediately.
      
      TODO: remove `internal` flag in `AccumulableInfo` with followup PR
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12525 from cloud-fan/acc.
      cb51680d
    • Reynold Xin's avatar
      [SPARK-14794][SQL] Don't pass analyze command into Hive · 228128ce
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze.
      
      ## How was this patch tested?
      Updated test case to reflect this change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12558 from rxin/parser-analyze.
      228128ce
    • Reynold Xin's avatar
      [HOTFIX] Disable flaky tests · 3b9fd517
      Reynold Xin authored
      3b9fd517
    • Reynold Xin's avatar
      [SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser · 77d847dd
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR.
      
      ## How was this patch tested?
      No test change. This simply moves code around.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12556 from rxin/SPARK-14792.
      77d847dd
    • Josh Rosen's avatar
      [SPARK-14786] Remove hive-cli dependency from hive subproject · cfe472a3
      Josh Rosen authored
      The `hive` subproject currently depends on `hive-cli` in order to perform a check to see whether a `SessionState` is an instance of `org.apache.hadoop.hive.cli.CliSessionState` (see #9589). The introduction of this `hive-cli` dependency has caused problems for users whose Hive metastore JAR classpaths don't include the `hive-cli` classes (such as in #11495).
      
      This patch removes this dependency on `hive-cli` and replaces the `isInstanceOf` check by reflection. I added a Maven Enforcer rule to ban `hive-cli` from the `hive` subproject in order to make sure that this dependency is not accidentally reintroduced.
      
      /cc rxin yhuai adrian-wang preecet
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12551 from JoshRosen/remove-hive-cli-dep-from-hive-subproject.
      cfe472a3
  2. Apr 20, 2016
    • Reynold Xin's avatar
      [SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from HiveSqlAstBuilder · 80458141
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser.
      
      This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor.
      
      ## How was this patch tested?
      This should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12550 from rxin/SPARK-14782.
      80458141
    • Josh Rosen's avatar
      [HOTFIX] Ignore all Docker integration tests · 90933e2a
      Josh Rosen authored
      The Docker integration tests are failing very often (https://spark-tests.appspot.com/failed-tests) so I think we should disable these suites for now until we have time to improve them.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12549 from JoshRosen/ignore-all-docker-tests.
      90933e2a
    • Reynold Xin's avatar
      [SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths · 24f338ba
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution.
      
      ## How was this patch tested?
      This is a small test refactoring to simplify test infrastructure.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12543 from rxin/SPARK-14775.
      24f338ba
    • Marcelo Vanzin's avatar
      [SPARK-14602][YARN] Use SparkConf to propagate the list of cached files. · f47dbf27
      Marcelo Vanzin authored
      This change avoids using the environment to pass this information, since
      with many jars it's easy to hit limits on certain OSes. Instead, it encodes
      the information into the Spark configuration propagated to the AM.
      
      The first problem that needed to be solved is a chicken & egg issue: the
      config file is distributed using the cache, and it needs to contain information
      about the files that are being distributed. To solve that, the code now treats
      the config archive especially, and uses slightly different code to distribute
      it, so that only its cache path needs to be saved to the config file.
      
      The second problem is that the extra information would show up in the Web UI,
      which made the environment tab even more noisy than it already is when lots
      of jars are listed. This is solved by two changes: the list of cached files
      is now read only once in the AM, and propagated down to the ExecutorRunnable
      code (which actually sends the list to the NMs when starting containers). The
      second change is to unset those config entries after the list is read, so that
      the SparkContext never sees them.
      
      Tested with both client and cluster mode by running "run-example SparkPi". This
      uploads a whole lot of files when run from a build dir (instead of a distribution,
      where the list is cleaned up), and I verified that the configs do not show
      up in the UI.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12487 from vanzin/SPARK-14602.
      f47dbf27
    • Reynold Xin's avatar
      [SPARK-14769][SQL] Create built-in functionality for variable substitution · 334c293e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage.
      
      Note that this pull request does not yet use this functionality anywhere.
      
      ## How was this patch tested?
      Added VariableSubstitutionSuite for unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12538 from rxin/SPARK-14769.
      334c293e
    • Reynold Xin's avatar
      [SPARK-14770][SQL] Remove unused queries in hive module test resources · b28fe448
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest.
      
      ## How was this patch tested?
      N/A - removing unused test resources.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12540 from rxin/SPARK-14770.
      b28fe448
    • Subhobrata Dey's avatar
      [SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually · fd826819
      Subhobrata Dey authored
      ## What changes were proposed in this pull request?
      
      3 testcases namely,
      
      ```
      "count is partially aggregated"
      "count distinct is partially aggregated"
      "mixed aggregates are partially aggregated"
      ```
      
      were failing when running PlannerSuite individually.
      The PR provides a fix for this.
      
      ## How was this patch tested?
      
      unit tests
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Subhobrata Dey <sbcd90@gmail.com>
      
      Closes #12532 from sbcd90/plannersuitetestsfix.
      fd826819
    • Sheamus K. Parkes's avatar
      [SPARK-13842] [PYSPARK] pyspark.sql.types.StructType accessor enhancements · e7791c4f
      Sheamus K. Parkes authored
      ## What changes were proposed in this pull request?
      
      Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance.
        - Iterating a `StructType` will iterate its fields
          - `[field.name for field in my_structtype]`
        - Indexing with a string will return a field by name
          - `my_structtype['my_field_name']`
        - Indexing with an integer will return a field by position
          - `my_structtype[0]`
        - Indexing with a slice will return a new `StructType` with just the chosen fields:
          - `my_structtype[1:3]`
        - The length is the number of fields (should also provide "truthiness" for free)
          - `len(my_structtype) == 2`
      
      ## How was this patch tested?
      
      Extended the unit test coverage in the accompanying `tests.py`.
      
      Author: Sheamus K. Parkes <shea.parkes@milliman.com>
      
      Closes #12251 from skparkes/pyspark-structtype-enhance.
      e7791c4f
    • Shixiong Zhu's avatar
      [SPARK-14678][SQL] Add a file sink log to support versioning and compaction · 7bc94855
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a special log for FileStreamSink for two purposes:
      
      - Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink.
      - Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files.
      
      FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog.
      
      FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files).
      
      ## How was this patch tested?
      
      FileStreamSinkLogSuite
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12435 from zsxwing/sink-log.
      7bc94855
    • Yanbo Liang's avatar
      [MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter · 296c384a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.
      
      ## How was this patch tested?
      Existing tests.
      
      cc jkbradley sethah
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12529 from yanboliang/typeConverter.
      296c384a
    • Andrew Or's avatar
      [SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState... · 8fc267ab
      Andrew Or authored
      [SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class
      
      ## What changes were proposed in this pull request?
      This PR has two main changes.
      1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext.
      2. Create a SparkSession Class, which will later be the entry point of Spark SQL users.
      
      ## How was this patch tested?
      Existing tests
      
      This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12522 from yhuai/spark-session.
      8fc267ab
    • Tathagata Das's avatar
      [SPARK-14741][SQL] Fixed error in reading json file stream inside a partitioned directory · cb8ea9e1
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Consider the following directory structure
      dir/col=X/some-files
      If we create a text format streaming dataframe on `dir/col=X/`  then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure:
      ```
      18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error
      java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8
      ```
      The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined.
      
      ## How was this patch tested?
      
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12517 from tdas/SPARK-14741.
      cb8ea9e1
    • Joseph K. Bradley's avatar
      [SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std · acc7e592
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.
      
      This PR documents this fact.
      
      ## How was this patch tested?
      
      doc only
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12519 from jkbradley/scaler-variance-doc.
      acc7e592
    • Yanbo Liang's avatar
      [MINOR][ML][PYSPARK] Fix omissive param setters which should use _set method · 08f84d7a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #11939 make Python param setters use the `_set` method. This PR fix omissive ones.
      
      ## How was this patch tested?
      Existing tests.
      
      cc jkbradley sethah
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12531 from yanboliang/setters-omissive.
      08f84d7a
    • jerryshao's avatar
      [SPARK-14725][CORE] Remove HttpServer class · 90cbc82f
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it.
      
      ## How was this patch tested?
      
      Unit test is verified locally.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12526 from jerryshao/SPARK-14725.
      90cbc82f
    • Sean Owen's avatar
      [SPARK-14742][DOCS] Redirect spark-ec2 doc to new location · b4e76a9a
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs
      
      ## How was this patch tested?
      
      `jekyll build` and checked with the browser
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12534 from srowen/SPARK-14742.
      b4e76a9a
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
    • Alex Bozarth's avatar
      [SPARK-8171][WEB UI] Javascript based infinite scrolling for the log page · 83427788
      Alex Bozarth authored
      Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #10910 from ajbozarth/spark8171.
      83427788
    • Yuhao Yang's avatar
      [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF · ed9d8038
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
      
      ## How was this patch tested?
      
      unit tests and doc generation
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #12454 from hhbyyh/tfdoc.
      ed9d8038
Loading