Giter Site home page Giter Site logo

Comments (14)

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

Thanks
Few questions:

  1. Will you be willing to share some logs along the timeline where you experienced duplicates?
  2. Where have you seen duplicates? In kafka table on snowflake side?

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

What exact you'd like to see? I can do a grep and get you exact logs

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

Duplicate in Kafka Snowflake tables (where it dumps data in snowflake) @sfc-gh-japatel

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

What version of SF KC are you using?
Logs to see if there was any exception or unique scenario to help us better understand where and how duplicates would occur

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

Entire log would be huge, but if you can comment what exactly needs to be seen in logs? I can do a grep and get exact what you need
version - 2.0.0

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

Few more questions/asks:

  1. Can you use latest version https://github.com/snowflakedb/snowflake-kafka-connector/releases/tag/v2.0.1
  2. Can you repro this issue everytime?
  3. Logs around the time you see duplicate candidates in your table in SF. I want to see where because of which exception this is occurring.
  4. I also see 38 topics in your single connector, do you have any conflicts here in terms of topic2tablemap? i.e does all of them go to same table or different tables?

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

One thing I want to mention,

I try to reload data a lot of times, and what I do is recreate sink connector with new name, so it loads from beginning of the topic.
I dont change the base table name in snowflake, so it always puts data in same table, from same kafka topic, but a different sink connector name.

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

I see, well this might be one of the reasons why you see duplicates. Can you always choose a new table in snowflake?

Also we have had bug in 2.0.0 which got fixed in 2.0.1 which could see missing data if you have lot of rebalances.
We havent seen duplicate issue but what you mention could be the reason.

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

Again, The Way I have setted up the whole setup is, there are streams on top of tables.
Issue is what Im seeing on Kafka Broker/Eventhub, Data going out x2 times.

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

Hope it doesn't have to do anything with the Offset resetting.
Also if there is a network error, doesn't it have to check the offset value before reloading data? @sfc-gh-japatel

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

Wo do that today! please verify from snowflake side that offset is ingested twice on your landing table. metadata column can be used for that and you can run some queries to identify if you have duplicates.

But if you restart the connector with same name you are bound to get this issue since new connector name means new consumer group and it will start from offset 0.

I would recommend:

  1. upgrading to newer version
  2. see if you can repro this again.
  3. send logs if you can.
  4. Confirm how you are checking duplicate data on snowflake side. sql Query if possible.

Thanks

from snowflake-kafka-connector.

imrohankataria avatar imrohankataria commented on September 22, 2024

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

https://docs.snowflake.com/en/user-guide/kafka-connector-overview#schema-of-tables-for-kafka-topics

Here is what I use

select 
record_metadata:"topic"::string as TOPIC, 
record_metadata:"partition"::number as PARTITION_NO,
count(distinct record_metadata:"offset"::number) as unique_offsets_per_partition,
count(record_metadata:"offset"::number) as count_per_partitions,
min(record_metadata:"offset"::number) as min_offset,
max(record_metadata:"offset"::number) as max_offset,
(max_offset-min_offset+1) as diff_min_max_offsets,
case diff_min_max_offsets-unique_offsets_per_partition
    when 0 then 'Nothing Missing'
    else 'Missing Data/Duplicate Data'
    end as data_completeness
from <base_table>
group by TOPIC, PARTITION_NO
order by PARTITION_NO
;

from snowflake-kafka-connector.

sfc-gh-japatel avatar sfc-gh-japatel commented on September 22, 2024

Hi @imrohankataria closing this since there has been no activity, feel free to open again with more details on how you were able to reproduce duplicate data. We would definitely take that as a priority. thanks

from snowflake-kafka-connector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.